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Preface 


This guide describes efficient methods for shared-memory programming 
using the following HP-UX compilers: HP Fortran, HP aC++(ANSI C++), 
and HP C. 

The Parallel Programming Guide for H P-UX is intended for use by 
experienced Fortran, C, and C++programmers. This guide describes the 
enhanced features of H P-UX 11.0 compilers on single-node 
multiprocessor HP technical servers. These enhancements include new 
loop optimizations and constructs for creating programs to run 
concurrently on multiple processors. 

You need not be familiar with the HP parallel architecture, programming 
models, or optimization concepts to understand the concepts introduced 
in this book. 
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Preface 


Scope 

This guide covers programming methods for the following HP compilers 
on V2200 and V2250 and K-Class machines running H P-UX 11.0 and 
higher: 

• HP Fortran Version 2.0 (and higher) 

• HP aC++Version 3.0 (and higher) 

• H P C Version 1.2.3 (and higher) 

The HP compilers now support an extensive shared-memory 
programming model. HP-UX 11.0 and higher includes the required 
assembler, linker, and libraries. 

This guide describes how to produce programs that efficiently exploit the 
features of HP parallel architecture concepts and the HP compiler set. 
Producing efficient programs requires the use of efficient algorithms and 
implementation. The techniques of writing an efficient algorithm are 
beyond the scope of this guide. It is assumed that you have chosen the 
best possible algorithm for your problem. This manual should help you 
obtain the best possible performance from that algorithm. 
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Notational conventions 

This section discusses notational conventions used in this book. 


bold monospace I n command examples, bold monospace 

identifies input that must be typed exactly as 
shown. 

monospace In paragraph text, monospace identifies 

command names, system calls, and data 
structures and types. 

I n command examples, monospace identifies 
command output, including error messages. 

italic I n paragraph text, italic identifies titles of 

documents. 

I n command syntax diagrams, italic identifies 
variables that you must provide. 

The following command example uses 
brackets to indicate that the variable 
output_fileis optional: 
command i nput_fi I e [output_fi Ie] 

Brackets ([ ]) In command examples, square brackets 

designate optional entries. 

Curly brackets ({}), I n command syntax diagrams, text 
Pipe (| ) surrounded by curly brackets indicates a 

choice. The choices available are shown inside 
the curly brackets and separated by the pipe 
sign (| ). 

The following command example indicates 
that you can enter either a or b: 

command {a I b} 
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NOTE 


Horizontal ellipses 


Vertical ellipses 


Keycap 


I n command examples, horizontal ellipses 
show repetition of the preceding items. 

Vertical ellipses show that lines of code have 
been left out of an example. 

Keycap indicates the keyboard keys you must 
press to execute the command example. 


The directives and pragmas described in this book can be used with the 
Fortran and C compilers, unless otherwise noted. The aC++compiler 
does not support the pragmas, but does support the memory classes. 

In general discussion, these directives and pragmas are presented in 
lowercase type, but each compiler recognizes them regardless of their 
case. 

References to man pages appear in the form mnpgname(l), where 
"mnpgname" is the name of the man page and is followed by its section 
number enclosed in parentheses. To view this man page, type: 

% man 1 mnpgname 

A Note highlights important supplemental information. 


Command syntax 

Consider this example: 

command input_file [... ] {a i b} [output_file] 

• command must be typed as it appears. 

• input_fi Vindicates a filenamethat must be supplied by the user. 

• The horizontal ellipsis in brackets indicates that additional, optional 
input file names may be supplied. 

• Either a or b must be supplied. 

• [output_file] indicates an optional file name. 
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Associated documents 

The following documents are listed as additional resources to help you 

use the compilers and associated tools: 

• HP Fortran Programmer's Guide—Provides extensive usage 
information (including how to compile and link), suggestions and 
tools for migrating to H P Fortran, and how to call C and HP-UX 
routines for HP Fortran 90. 

• HP Fortran Programmer's Reference—Presents complete Fortran 90 
language reference information. It also covers compiler options, 
compiler directives, and library information. 

• HP aC-H-Online Programmer's Guide— Presents referenceand 
tutorial information on aC++. This manual is only available in html 
format. 

• HPMPI User’s Guide—Discusses message-passing programming 
using Hewlett-Packard's Message-Passing I nterface library. 

• Programming with Threads on HP-UX—Discusses programming 
with POSIX threads. 

• HP C/ HP-UX Reference Manual—Presents reference information on 
the C programming language, as implemented by H P. 

• HP C/ HP-UX Programmer’s Guide— Contains detailed discussions of 
selected C topics. 

• H P-UX Linker and Libraries User's Guide—Describes how to develop 
software on HP-UX, using the HP compilers, assemblers, linker, 
libraries, and object files. 

• M anagi ng Systems and Workgroups—Descri bes how to perform 
various system administration tasks. 
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• Threadtime by Scott J. Norton and Mark D. DiPasquale—Provides 
detailed guidelines on the basics of thread management, including 
POSIX thread structure; thread management functions; and the 
creation, termination and synchronization of threads. 

• HP M LI B User's GuideVECLI B and LAPACK—Provides usage 
information about mathematical software and computational kernels 
for engineering and scientific applications. 


xx 
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Introduction 


Hewlett-Packard compilers generate efficient parallel code with little 
user intervention. However, you can increasethis efficiency by using the 
techniques discussed in this book. 

Thischapter contains a discussion of the foil owing topics: 

• HP SMP architectures 

• Parallel programming model 

• Overview of HP optimizations 
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HP SMP architectures 


HP SMP architectures 

Hewlett-Packard offers single-processor and symmetric multiprocessor 
(SMP) systems. This book focuses on SMP systems, specifically, those 
that utilize different bus configurations for memory access. These are 
briefly described in the foil owing sections, and in more detail in the 
"Architecture overview" section. 

Bus-based systems 

The K-Class servers are midrange servers with a bus-based architecture. 
It contains oneset of processors and physical memory. Memory is shared 
among all the processors, with a bus serving as the interconnect. The 
shared-memory architecture has a uniform access time from each 
processor. 

Hyperplane I nterconnect systems 

The V-Class servers configurations range from one to 16 processors on 
the V-Class single-node system. These systems have the following 
characteristics: 

• Processors communicate with each other through memory and by 
using I/O devices through a Hyperplane I nterconnect nonblocking 
crossbar. 

• Scalable physical memory. The current V-Class server support up to 
16 G bytes of memory. 

• Each process on an H P system can access a 16-terabyte (Tbyte) 
virtual address space. 
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Parallel programming model 


Figure 1 


Parallel programming model 

Parallel programming models provide perspectives from which you can 
write—or adapt—code to run on a high-end HP system. You can perform 
both shared-memory programming and message-passing programming 
on an SM P. This book focuses on using the shared-memory paradigm, but 
includes reference material and pointers toother manuals about 
message passing. 

The shared-memory paradigm 

I n the shared-memory paradigm, compilers handle optimizations, and, if 
requested, parallelization. Numerous compiler directives and pragmas 
are availableto further increase optimization opportunities. 
Parallelization can also be specified using POSIX threads (Pthreads). 

F igure 1 shows the SM P model for the shared-memory paradigm. 


Symmetric multiprocessor system 

Symmetric multiprocessor system 



The directives and pragmas associated with the shared-memory 
programming model are discussed in the chapter titled "Parallel 
Programming Techniques," "Memory classes," and "Parallel 
synchronization." 
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Parallel programming model 


The message-passing paradigm 

HP has implemented a version of the message-passing interface (M PI) 
standard known as HP M PI . This implementation is finely tuned for H P 
technical servers. 

I n message-passing, a parallel application consists of a number of 
processes that run concurrently. Each process has its own local memory. 
It communicates with other processes by sending and receiving 
messages. When data is passed in a message, both processes must work 
to transfer the data from the local memory of one to the local memory of 
the other. 

U nder the message-passing paradigm, functions allow you to explicitly 
spawn parallel processes, communicate data among them, and 
coordinate their activities. U nlike the previous model, there is no shared- 
memory. Each process has its own private 16-terabyte (Tbyte) address 
space, and any data that must be shared must be explicitly passed 
between processes. Figure 2 shows a layout of the message-passing 
paradigm. 

Figure 2 Message-passing programming model 


Distributed memory model 
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Parallel programming model 


Support of message passing allows programs written under this 
paradigm for distributed memory to be easily ported to H P servers. 
Programs that require more per-process memory than possible using 
shared-memory benefit from the manually-tuned message-passing style. 

For more information about HP M PI, seethe H P M PI User’s Guide and 
the M PI Reference. 
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Overview of HP optimizations 

HP compilers perform a range of user-selectable optimizations. These 
new and standard optimizations, specified using compiler command-line 
options, are briefly introduced here. A more thorough discussion, 
including the features associated with each, is provided in "Optimization 
levels," on page 25. 

Basic scalar optimizations 

Basic scalar optimizations improve performance at the basic block and 
program unit level. 

A basic block is a sequence of statements that has a single entry point 
and a singleexit. Branches do not exist within the body of a basic block. 
A program unit is a subroutine, function, or main program in Fortran or 
a function (including main) in C and C++. Program units are also 
generically referred to as procedures. Basic blocks are contained within 
program units. Optimizations at the program unit level span basic 
blocks. 

To improve performance, basic optimizations perform the foil owing 
activities: 

• Exploit the processor’s functional units and registers 

• Reduce the number of ti mes memory is accessed 

• Simplify expressions 

• Eliminate redundant operations 

• Replace variables with constants 

• Replace slow operations with faster equivalents 
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Advanced scalar optimizations 

Advanced scalar optimizations are primarily intended to maximize data 
cache usage. This is referred to as data localization. Concentrating on 
loops, these optimizations strive to encache the data most frequently 
used by the loop and keep it encached so as to avoid costly memory 
accesses. 

Advanced scalar optimizations include several loop transformations. 
Many of these optimizations either facilitate more efficient strip mining 
or are performed on strip-mined loops to optimize processor data cache 
usage. All of these opt i mi zat ions a re covered in "Controlling 
optimization,"on page 113. 

Advanced scalar optimizations implicitly include all basic scalar 
optimizations. 

Parallelization 

HP compilers automatically locate and exploit loop-level parallelism in 
most programs. Using thetechniques described in "Parallel 
programming techniques," on page 175, you can help the compilers find 
even more parallelism in your programs. 

Loops that have been data-localized are prime candidates for 
parallelization. Individual iterations of loops that contain strips of 
localizabledata are parcelled out among several processors and run 
simultaneously. For example, the maximum number of processors that 
can be used is limited by the number of iterations of the loop and by 
processor availability. 

While most parallelization is done on nested, data-localized loops, other 
code can also be parallelized. For example, through the use of manually 
inserted compiler directives, sections of code outside of loops can also be 
parallelized. 

Parallelization optimizations implicitly include both basic and advanced 
scalar optimizations. 
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Architecture overview 


This chapter provides an overview of Hewlett-Packard's shared memory 
K-Class and V-Class architectures. The information in this chapter 
focuses on this architecture as it relates to parallel programming. 

This chapter describes architectural features of H P's K-Class and 
V-Class. For more information on thefamily of V-Class servers, seethe 
V-CI ass Architecture manual. 
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System architectures 


Figure 3 


System architectures 

PA-RISC processors communicate with each other, with memory, and 
with peripherals through various bus configuration. The difference 
between the K-Class and V-Class servers are presented bythemanner in 
which they access memory. The K-Class maintains a bus-based 
configuration, shown in Figure3. 

K-Class bus configuration 



Processor-Memory Bus 




On a V-Class, processors communicate with each other, memory, and 
peripherals through a nonblocking crossbar. The V-Class implementation 
is achieved through the Hyperplane I nterconnect, shown in Figure4. 

The H P V2250 server has one to 16 PA-8200 processors and 256 M bytes 
to 16 Gbytes of physical memory. Two CP Us and a PCI bus share a single 
CPU agent. The CP Us communicate with the rest of the machine 
through the CPU agent. TheMemory Access Controllers (MACs) provide 
the interface between the memory banks and the rest of the machine. 

CPUs communicate directly with their own instruction and data caches, 
which are accessed by the processor in one clock (assuming a full 
pipeline). V2250 servers use2-M byteoff-chip instruction caches and data 
caches. 
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Figure 4 V2250 Hyperplane Interconnect view 



PCI: PCI Bus Controller 



Memory 



Memory] 


Memory 


Memory 
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Data caches 

HP systems use cache to enhance performance. Cache sizes, as well as 
cache line sizes, vary with the processor used. Data is moved between the 
cache and memory using cache lines. A cache line describes the size of a 
chunk of contiguous data that must be copied into or out of a cache in one 
operation. 

When a processor experiences a cache miss—requests data that is not 
al ready encached—the cache I i ne contai ni ng the address of the requested 
data is moved to the cache. This cache line also contai ns a number of 
other data objects that were not specifically requested. 

One reason cache lines are employed is to allow for data reuse. Data in a 
cache line is subject to reuse if, while the line is encached, any of the data 
elements contained in the line besides the originally requested element 
are referenced by the program, or if the originally requested element is 
referenced more than once. 

Because data can only be moved to and from memory as part of a cache 
line, both load and store operations cause their operands to be encached. 
Cache-coherency hardware, as found on a V2250, invalidates cache lines 
in other processors when they are stored to by a particular processor. 
This indicates toother processors that they must load the cache line from 
memory the next time they reference its data. 

Data alignment 

Aligning data addresses on cache line boundaries allows for efficient 
data reuse in loops (refer to "Data reuse," on page 71). The linker 
automatically aligns data objects larger than 32 bytes in size on 
a 32-byte boundary. It also aligns data greater than a page size on a 64- 
byte boundary. 

Only the first item in a list of data objects appearing in any of these 
statements is aligned on a cache line boundary. To make the most 
efficient use of available memory, the total size, in bytes, of any array 
appearing in one of these statements should bean integral multiple 
of 32. 

Sizing your arrays this way prevents data following thefirst array from 
becoming misaligned. Scalar variables should be listed after arrays and 
ordered from longest data type to shortest. For example, real*8 scalars 
should precede real*4 scalars. 
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You can align data on 64-byte boundaries by doing thefollowing. These 
apply only to parallel executables: 

• Using Fortran allocate statements 

• Using theC functions malloc or memory_class_malloc 

Aliases can inhibit data alignment. Be careful when equivalencing arrays in 
Fortran. 

Cache thrashing 

Cache thrashing occurs when two or more data items that are frequently 
needed by the program both map to the same cache address. E ach ti me 
one of the items is encached, it overwrites another needed item, causing 
cache misses and impairing data reuse. This section explains how 
thrashing happens on the V-Class. 

A type of thrashing known as false cache line sharing is discussed in the 
section "False cacheline sharing” on page279. 

Cache thrashing 

Thefollowing Fortran example provides an example of cache thrashing: 

REAL*8 ORIG (131072), NEW(131072), DISP(131072) 

COMMON /BLK1/ ORIG, NEW, DISP 


DO I = 1, N 

NEW(I) = ORIG(I) + DISP(I) 

ENDDO 

I n this example, the arrays orig and disp overwrite each other in 
a 2-M byte cache. Because the arrays are in a common block, they are 
allocated in contiguous memory in the order shown. Each array element 
occupies 8 bytes, so each array occupies one Mbyte (8 x 131072=1048576 
bytes). Therefore, arrays orig and disp are exactly 2-M bytes apart in 
memory, and all their elements have identical cache addresses. The 
layout of the arrays in memory and in the data cache is shown in 
Figure 5. 
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Figure 5 


Array layouts—cache-thrashing 



Memory 


When the addition in the body of the loop executes, the current elements 
of both orig and disp must be fetched from memory intothe cache. 
Because these elements have identical cache addresses, whichever is 
fetched last overwrites thefirst. Processor cache data is fetched 32 bytes 
at a time. 

To efficiently execute a loop such as this, the unused elements in the 
fetched cache line (three extra real *8 elements are fetched in this case) 
must remain encached until they are used in subsequent iterations of the 
loop. Because orig and disp thrash each other, this reuse is never 
possible. Every cache line of orig that isfetched is overwritten by the 
cache line of disp that is subsequently fetched, and vice versa. The 
cache line is overwritten on every iteration. Typically, in a loop like this, 
it would not be overwritten until all of its elements were used. 

Memory accesses take substantially longer than cache accesses, which 
severely degrades performance. Even if the overwriting involved the new 
array, which is stored rather than loaded on each iteration, thrashing 
would occur, because stores overwrite entire cache lines the same way 
loads do. 

The problem is easily fixed by increasing the distance between the 
arrays. You can accomplish this by either increasing the array sizes or 
inserting a padding array. 
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Cache padding 

The following Fortran example illustrates cache padding: 

REAL*8 ORIG(131072), NEW(131072), P (4),DISP (131072) 
COMMON /BLK1/ ORIG, NEW, P, DISP 


I n this example, the array p (4 ) moves disp 32 bytes further from orig 
in memory No two elements of the same index share a cache address. 
This postpones cache overwriting for the given loop until the entire 
current cache line is completely exploited. 

The alternate approach involves increasing the size of orig or new by 4 
elements (32 bytes), as shown in the foil owing example: 

REAL*8 ORIG(131072), NEW(131080), DISP(131072) 

COMMON /BLK1/ ORIG, NEW, DISP 


Here, new has been increased by 4 elements, providing the padding 
necessary to prevent orig from sharing cache addresses with disp. 
Figure 6 shows how both solutions prevent thrashing. 

Array layouts—non-thrashing 
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It is important to note that this is a highly simplified, worst-case 
example. 

Loop blocking optimization (described in "Loop blocking”on page 70) 
eliminates thrashing from certain nested loops, but not from all loops. 
Declaring arrays with dimensions that are not powers of two can help, 
but it does not completely eliminate the problem. 

Using common blocks in Fortran can also help because it allows you to 
accurately measure distances between data items, making thrashing 
problems easier to spot before they happen. 
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Memory Systems 

HP’s K-Class and V-Class servers maintain a single level of memory 
latency. Memory functions and interleaving work similarly on both 
servers, as described in thefollowing sections. 

Physical memory 

Multiple, independently accessible memory banks are available on both 
the K-Class and V-Class servers. I n 16-processor V2250 servers, for 
example, each node consists of up to 32 memory banks. This memory is 
typically partitioned (by the system administrator) into system-global, 
and buffer cache. It is also interleaved as described in "I nterleaving” 
section on page 18”. The K-Class architecture supports up to four 
memory banks. 

System-global memory is accessible by all processors in a given system. 
The buffer cache is a filesystem cache and is used to encache items that 
have been read from disk and items that are to be written to disk. 

Memory interleaving is used to improve performance. For an 
explanation, seethesection "Interleaving”section on page 18. 

Virtual memory 

Each process running on a V-Class or K-Class server under 
HP-UX accesses its own 16-Tbyte virtual address space. Almost all of 
this space is availableto hold program text, data, and the stack. The 
space used by the operating system is negligible. 

The memory stack size is configurable. Refer tothesection "Setting 
thread default stack size" on page 202 for more information. 

Both servers share data among all threads unless a variable is declared 
to bethread private. Memory class definitions describing data 
disposition across hypernodes have been retained for the V-Class. This is 
primarilyfor potential use when porting to multi node machines. 
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thread_private 

This memory is private to each thread of a process. A 
thread_private data object has a unique virtual 
address for each thread. These addresses map to 
unique physical addresses in hypernode-local physical 
memory. 

node_private 

This memory is shared among the threads of a process 
running on a single node. Since the V-Class and 
K-Class servers are single-node machines, 
node_private actually serves as one common shared 
memory cl ass. 

M emory cl asses are discussed more fully in "Memory cl asses," on 
page 233. 

Processes cannot access each other's virtual address spaces. This virtual 
memory maps to the physical memory of the system on which the process 
is running. 

Interleaving 

Physical pages are interleaved across the memory banks on a cache-line 
basis. There are up to 32 banks in the V2250 servers; there are up to four 
on a K-Class. Contiguous cache lines are assigned in round-robin 
fashion, first totheeven banks, then totheodd, as shown in Figure 7 for 
V2250 servers. 

I nterleaving speeds memory accesses by allowing several processors to 
access contiguous data simultaneously. It also eliminates busy bank and 
board waits for unit stride accesses. This is beneficial when a loop that 
manipulates arrays is split among many processors. I n the best case, 
threads access data in patterns with no bank contention. Even in the 
worst case, in which each thread initially needs the same data from the 
same bank, after the initial contention delay, the accesses are spread out 
among the banks. 
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Figure 7 V2250 interleaving 
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Interleaving 

The following Fortran example illustrates a nested loop that accesses 
memory with very little contention. This example is greatly simplified for 
illustrative purposes, but the concepts apply to arrays of any size. 

REAL*8 A(12,12), B(12,12) 

DO J = 1, N 
DO I = 1, N 

A (I, J) = B (I, J) 

ENDDO 

ENDDO 

Assume that arrays a and b are stored contiguously in memory, with a 
starting in bank 0, processor cache line 0 for V2250 servers, as shown in 
Figure 8 on page 22. 

You may assume that the H P Fortran compiler parallelizes the j loop to 
run on as many processors as are available in thesystem (uptoN). 
Assuming n =12 and therearefour processors available when the 
program is run, the j loop could bedivided into four new loops, each with 
3 iterations. Each new loop would run tocompletion on a separate 
processor. These four processors are identified as CPU 0 through CPU 3. 

This example is designed to simplify illustration. In reality, the dynamic 
selection optimization (discussed in “Dynamic selection” on page 102) 
would, given the iteration count and available number of processors 
described, cause this loop to run serially. The overhead of going parallel 
would outweigh the benefits. 

I n order to execute the body of the i loop, a and b must be fetched from 
memory and encached. Each of the four processors running the j loop 
attempt to simultaneously fetch its portion of the arrays. 

This means CPUO will attempt to read arrays a and b starting at 
elements ( l, l) , CPU 1 will attempt to start at elements (1,4) and so 
on. 

Because of the number of memory banks in the V2250 architecture, 
interleaving removes the contention from the beginning of the loop from 
the example, as shown in Figure 8. 

• CPUO needs a (l: 12 , l: 3) and B (1 : 12 , 1 : 3) 

• CPU1 needs A (1 : 12, 4 : 6) and B (1 : 12, 4 : 6) 

• CPU2 needSA(l:12,7:9) and B (1 : 12,7 : 9) 
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• CPU3 needs A (1 : 12, 10 : 12) and B (1 : 12, 10 : 12) 

The data from the V2250 example above is spread out on different 
memory banks as described below: 

• a (l, 1 ), the first element of the chunk needed by CPU 0, is on cache 
line 0 in bank 0 on board 0 

• a ( 1 , 4), the first element needed by CPU 1, is on cache line 9 in bank 
1 on board 1 

• a ( 1 , 7), the first element needed by CPU 2, is on cache line 18 in 
bank 2 on board 2 

• a ( 1 , 10 ) the first element needed by CPU 3, is on cache line 27 in 
bank 3 on board 3 

Because of interleaving, no contention exists between the processors 
when trying to read their respective portions of the arrays. Contention 
may surface occasional ly as the processors make thei r way through the 
data, but the resulting delays are minimal compared to what could be 
expected without interleaving. 
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Figure 8 V2250 interleaving of arrays a and b 
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Variable-sized pages on HP-UX 

Variable-sized pages are used to reduce Translation Lookaside Buffer 
(TLB) misses, improving performance. A TLB is a hardware entity used 
to hold a virtual to physical address translation. With variable-sized 
pages, each TLB entry used can map a larger portion of an application's 
virtual address space. Thus, applications with large data sets are 
mapped using fewer TLB entries, resulting in fewer TLB misses. 

Using a different page size does not help if an application is not 
experiencing performance degradation duetoTLB misses. Additionally, 
if an application uses too large a page size, fewer pages areavailableto 
other applications on the system. This potentially results in increased 
paging activity and performance degradation. 

Valid page sizes on the PA-8200 processors are4K, 16K, 64K, 256K, 

1 M byte, 4 M bytes, 16 M bytes, 64 M bytes, and 256 M bytes. The default 
configurable page size is 4K. Methods for specifying a page size are 
described below. Note that the user-specified page size only requests a 
specific size. The operating system takes various factors into account 
when selecting the page size. 

Specifying a page size 

The following chatr utility command options allow you to specify 
information regarding page sizes. 

• +pi affects the page size for the application's text segment 

• +pd affects the page size for the appl ication's data segment 

The following configurable kernel parameters allow you to specify 
information regarding page sizes. 

• vps_pagesize represents the default or minimum page size (in 
kilobytes) if the user has not used chatr to specify a value. The 
default is 4Kbytes. 

• vps_ceiiing represents the maxi mum pagesize(in kilobytes) ifthe 
user has not used chatr to specify a value. The default is 16Kbytes. 

• vps_chatr_ceiiing places a restriction on the largest value (in 
kilobytes) a user can specify using chatr. The default is 64 Mbytes. 

For more information on the chatr utility, seethe chatr (l) man page. 
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Table 1 


This chapter discusses various optimization levels available with the H P 
compilers, including: 

• HP optimization levels and features 

• Using the Optimizer 

The locations of the compilers discussed in this manual are provided in 
Table 1. 

Locations of HP compilers 


Compiler 

Description 

Location 

f90 

HP Fortran 

/opt/fortran90/bin/f90 

cc 

ANSI C 

/opt/ansic/bin/c89 

aC ++ 

ANSI C++ 

/opt/aCC/bin/aCC 


For detailed information about optimization command-line options, and 
pragmas and directives, see "Controlling optimization,"on page 113. 
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HP optimization levels and features 

This section provides an overview of optimization features which can be 
through either the command-line optimization options or manual 
specification using pragmas or directives. 

Five optimization levels are avail able for use with the HP compiler: +oo 
(the default), +01, +02, + 03 , and + 04 . These options have identical 
names and perform identical optimizations, regardless of which compiler 
you are using. They can also be specified on the compiler command line 
in conjunction with other options you may want to use. H P compiler 
optimization levels are described in Table 2. 
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Table 2 Optimization levels and features 


Optimization 

Levels 

Features 

Benefits 

+oo (the default) 

Occurs at the machine-instruction 
level 

Constant folding 

Data alignment on natural 
boundaries 

Partial evaluation of test conditions 
Registers (simpleallocation) 

Compiles fastest. 

+ 01 

Occurs at the block level 

Produces faster programs 


Branch optimization 

than +oo, and compiles faster 

includes all of 

Dead code elimination 

than level + 02 . 

+oo 

1 nstruction scheduler 

Peephole optimizations 

Registers (faster allocation) 


+02 (-0) 

Occurs at the routine level 

Can produce faster run-time 


Common subexpression elimination 

code than +oi if loops are 

includes all of 

Constant folding (advanced) and 

used extensively. 

+00, +01 

propagation 



Loop-invariant code motion 

Run-times for loop-oriented 


Loop unrolling 

floating-point intensive 


Registers (global allocation) 

applications may be reduced 


Register reassociation 

Software pipelining 

up to 90 per cent. 


Store/copy optimization 

Operating system and 


Strength reduction of induction 

interactive applications that 


variables and constants 

use the optimized system 


Unused definition elimination 

libraries may achieve 30 per 
cent to 50 per cent additional 
improvement. 
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Optimization 

Levels 

Features 

Benefits 

+03 

Occurs at the file level 

Can produce faster run-time 


Cloning within a single source file 

code than +02 on code that 

includes all of 

Data localization 

frequently calls small 

+00 , +01 , +02 

Automatic and directive-specified loop 

functions, or if loops are 


parallelization 

extensively used. Links faster 


Directive-specified region 
parallelization 

Directive-specified task 
parallelization 

than +04. 


1 nlining within a single source file 

Loop blocking 

Loop distribution 

Loop fusion 

Loop interchange 

Loop reordering - preventing 

Loop unroll and jam 

Parallelization 

Parallelization, preventing 

Reductions 

Test promotion 

All of thedi recti ves and pragmas of the HP 
parallel programming modd areavailable 
in theFortranand 

C compilers. 

prefer_parallel requests 
parallelization of thefollowing loop 
loop_parallel forces 
parallelization on the last loop 
parallel, end_parallel 
parallelizes a single code region to run 
on multiplethreads. 

begin_tasks, next_task, 
end_tasks forces parallelization of 
fol lowi ng code section 
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Optimization 

Levels 

Features 

Benefits 

+ 04 

includes all of 

+00, +01, +02, 

+03 

Not available in 
Fortran 

Occurs at the cross-module level and 
performed at link time 

Cloning across multiple source files 
Global/static variable optimizations 

1 nlining across multi pie source files 

P roduces faster ru n-ti me code 
than when +03 global 
variables are used or when 
procedure calls areinlined 
across modules. 
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Cumulative Options 

The optimization options that control an optimization level are 
cumulative so that each option retains the optimizations of the previous 
option. For example, entering the foil owing command line compiles the 
Fortran program foo.f with all + 02 , + 01 , and +00 optimizations shown in 
Table2: 

% f90 +02 foo.f 

I n addition to these options, the +Oparaiiei option is avail able for use 
at +03 and above; +Onoparaiiei is the default, When the +Oparaiiei 
option is specified, the compiler: 

• Looks for opportunities for parallel execution in loops 

• Honors the parallelism-related directives and pragmas of the H P 
parallel programming model. 

The +Onoautopar (noautomatic parallelization) option is availablefor 
use with +Oparaiiei at +03 and above. +Oautopar is the default. 
+Onoautopar causes thecompi ler to parallel izeonly those loops that are 
immediately preceded by loop_parallel or prefer_parallel 
directives or pragmas. For more information, refer to "Parallel 
programming techniques," on page 175. 
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Using the Optimizer 

Before exploring the various optimizations that are performed, it is 

important to review the coding guidelines used to assist the optimizer. 

This section is broken down into the foil owing subsections: 

• General guidelines 

• C and C-H-guidelines 

• Fortran guidelines 

General guidelines 

The coding guidelines presented in this section help the optimizer to 

optimize your program, regardless of the language in which the program 

is written. 

• Uselocal variablestohelptheoptimizer promote variables to 
registers. 

• Do not use local variables beforethey are initialized. When you 
request +02, +03, or +04 optimizations, thecompiler tries to detect 
and indicate violations of this rule. See "+o [no] initcheck” on 
page 123for related information. 

• Use constants instead of variables in arithmetic expressions such as 
shift, multiplication, division, or remainder operations. 

• Position the loop inside the procedure or use a directive to call the 
loop in parallel, when a loop contains a procedure call. 

• Construct loops so the induction variable increases or decreases 
toward zero where possible. The code generated for a loop termination 
test is more efficient with a test against zero than with a test against 
some other value. 

• Do not reference outside the bounds of an array. Fortran provides 
the -c option to check whether your program references outside array 
bounds. 

• Do not pass an incorrect number of arguments to a function. 
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C and C-H-guidelines 

The coding guidelines presented in this section help the optimizer to 
optimize your C and C++programs. 

• Usedo loops and for loopsin place of while loops, do loops and for 
loops are more efficient because opportunities for removing loop- 
invariant code are greater. 

• Use register variables where possible. 

• Use unsigned variables rather than signed, when using short or 
char variables or bit-fields. This is more efficient because a signed 
variable causes an extra instruction to be generated. 

• Pass and return pointers to large structs instead of passing and 
returning large structs by value, where possible. 

• Use type-checking tools like lint to help eliminate semantic errors. 

• Use local variables for the upper bounds (stop values) of loops. Using 
local variables may enable the compiler to optimize the loop. 

During optimization, the compiler gathers information about the use of 
variables and passes this information to the optimizer. The optimizer 
uses this information to ensurethat every code transformation 
maintains the correctness of the program, at least to the extent that the 
original unoptimized program is correct. 

When gathering this information, thecompiler assumes that while 
inside a function, the only variables that are accessed indirectly through 
a pointer or by another function call are: 

• Global variables (all variables with file scope) 

• Local variables that have had their addresses taken either explicitly 
by the & operator, or implicitly by the automatic conversion of array 
references to pointers. 
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In general, the preceding assumption should not pose a problem. 
Standard-compliant C and C++programs do not viol ate this assumption. 
However, if you have code that does violate this assumption, the 
optimizer can change the behavior of the program in an undesirable way. 
In particular, you should follow the coding practices to ensure correct 
program execution for optimized code: 

• Avoid using variables that are accessed by external processes. Unless 
a variable is declared with the volatile attribute, the compiler 
assumes that a program's data is accessed only by that program. 
Using the volatile attribute may significantly slow down a 
program. 

• Avoid accessing an array other than the one being subscripted. For 
example, theconstruct a [b-a] , wherea and b are the same type of 
array, actually references the array b, because it is equivalent to 

* (a+ (b-a) ), which is equivalent to *b. Using this construct might 
yield unexpected optimization results. 

• Avoid referencing outside the bounds of the objects a pointer is 
pointing to. All references of the form * (p+i) are assumed to remain 
within the bounds of the variable or variables that p was assigned to 
poi nt to. 

• Do not rely on the memory layout scheme when manipulating 
pointers, as incorrect optimizations may result. For example, if p is 
pointing tothefirst member of a structure, do not assume that p+i 
points to the second member of the structure. Additionally, if p is 
pointing tothefirst in a list of declared variables, p+i is not 
necessarily pointing to the second variable in the list. 

For more information regarding coding guidelines, see "General 
guidelines”section on page31. 
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Fortran guidelines 

The coding guidelines presented in this section help the optimizer to 
optimize Fortran programs. 

As part of the optimization process, the compiler gathers information 
about the use of variables and passes this information to the optimizer. 
The optimizer uses this information to ensure that every code 
transformation maintains the correctness of the program, at least to the 
extent that the original unoptimized program is correct. 

When gathering this information, thecompiler assumes that insidea 
routine (either a function or a subroutine) the only variables that are 
accessed (directly or indirectly) are: 

• common variables declared in the routine 

• Local variables 

• Parameters to this routine 

Local variables include all static and nonstatic variables. 

I n general, you do not need to be concerned about the preceding 
assumption. However, if you have code that violates it, the optimizer can 
adversely affect the behavior of the program. 

Avoid using variables that are accessed by a process other than the 
program. The compiler assumes that the program is the only process 
accessing its data. The only exception is the shared common variable in 
Fortran. 

Also avoid using extensiveequivalencing and memory-mapping schemes, 
where possible. 

See the section "General guidelines" section on page 31 for additional 
guidelines. 
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This chapter discusses the standard optimization features available with 
the H P-UX compilers, including those inherent in optimization levels 
+oo through +02. This includes a discussion of the foil owing topics: 

• Constant folding 

• Partial evaluation of test conditions 

• Simple register assignment 

• Data alignment on natural boundaries 

• Branch optimization 

• Dead code elimination 

• Faster register allocation 

• Instruction scheduling 

• Peephole optimizations 

• Advanced constant folding and propagation 

• Common subexpression elimination 

• Global register allocation (GRA) 

• Loop-invariant code motion, and unrolling 

• Register reassociation 

• Software pipelining 

• Strength reduction of induction variables and constants 

• Store and copy optimization 

• Unused definition elimination 

For more information as to specific command-line options, pragmas and 
directives for optimization, please see "Controlling optimization," on 
page 113. 
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Machine instruction level optimizations 

(+oo) 

At optimization level +oo, thecompiler performs optimizations that span 
only a single source statement. This is the default. The +oo machine 
instruction level optimizations include: 

• Constant folding 

• Partial evaluation of test conditions 

• Simple register assignment 

• Data alignment on natural boundaries 

Constant folding 

Constant folding is the replacement of operations on constants with the 
result of the operation. For example, y=5 + 7 is replaced with y= 12 . 

More advanced constant folding is performed at optimization level +02. 
Seethesection "Advanced constant folding and propagation" section on 
page 42 for more information. 

Partial evaluation of test conditions 

Where possible, the compiler determines the truth value of a logical 
expression without evaluating all the operands. This is known as short- 
circuiting. The Fortran example below describes this: 

IF ((I .EQ. J) .OR. (I .EQ. K)) GOTO 100 

If (i .eq. j) is true, control immediately goes to too; otherwise, 

(i .eq. k) must be evaluated before control can goto 100 or the 
fol lowi ng statement. 

Do not rely upon partial evaluation if you use function calls in the logical 
expression because: 

• There is no guarantee on the order of evaluation. 

• A procedure or function call can have side effects on variable values 
that may or may not be partially evaluated correctly. 
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Simple register assignment 

The compiler may place frequently used variables in registers to avoid 
more costly accesses to memory. 

A more advanced register assignment algorithm is used at optimization 
level + 02 . Seethesection "Global register allocation (GRA)"section on 
page 43 for more information. 

Data alignment on natural boundaries 

The compiler automatically aligns data objects to their natural 
boundaries in memory, providing more efficient access to data. This 
means that a data object's address is integrally divisible by the length of 
its data type; for example, real*8 objects have addresses integrally 
divisible by 8 bytes. 

Aliases can inhibit data alignment. Be especially careful when equivalencing 
arrays in Fortran. 

Declare scalar variables in order from longest to shortest data length to 
ensure the efficient layout of such aligned data in memory. This 
minimizes the amount of padding the compiler has to do to get the data 
onto its natural boundary. 

Data alignment on natural boundaries 

Thefollowing Fortran example describes the alignment of data objects to 
their natural boundaries: 

C CAUTION: POORLY ORDERED DATA FOLLOWS: 

L0GICAL*2 BOOL 
INTEGER*8 A, B 
REAL*4 C 
REAL*8 D 

FI ere, the compiler must insert 6 unused bytes after bool in order to 
correctly align a, and it must insert 4 unused bytes after c to correctly 
align d. 
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The same data is more efficiently ordered as shown in the following 
example: 

C PROPERLY ORDERED DATA FOLLOWS: 

INTEGER*8 A, B 
REAL*8 D 
REAL*4 C 
LOGICAL*2 BOOL 

Natural boundary alignment is performed on all data. This is not to be 
confused with cache line boundary alignment. 
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Block level optimizations (+01) 

At optimization level + 01 , thecompiler performs optimizations on a 
block level. Thecompiler continues to run the+oo optimizations, with 
thefollowing additions: 

• Branch optimization 

• Dead code elimination 

• Faster register allocation 

• I nstruction scheduling 

• Peephole optimizations 

Branch optimization 

Branch optimization involves traversing the procedureand transforming 
branch instruction sequences into more efficient sequences where 
possible. Examples of possible transformations are: 

• Deleting branches whose target is thefall-through instruction (the 
target is two instructions away) 

• C hangi ng the target of the fi rst branch to be the target of the second 
(unconditional) branch when thetarget of a branch is an 
unconditional branch 

• Transforming an unconditional branch at the bottom of a loop that 
branches to a conditional branch at the top of the loop into a 
conditional branch at the bottom of the loop 

• Changing an unconditional branch to the exit of a procedure into an 
exit sequence where possible 

• Changing conditional or unconditional branch instructions that 
branch over a single instruction into a conditional nullification in the 
following instruction 

• Looking for conditional branches over unconditional branches, where 
the sense of the first branch could be inverted and the second branch 
deleted. These result from null then clauses and from then clauses 
that only contain goto statements. 
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Conditional/unconditional branches 

The following Fortran example provides a transformation from a branch 
instruction to a more efficient sequence: 

IF (L) THEN 
A=A*2 
ELSE 

GOTO 100 
ENDIF 
B=A+1 

100 C=A*10 

becomes: 

IF (.NOT. L) GOTO 100 

A=A*2 

B=A+1 

100 C=A*10 


Dead code elimination 

Dead code elimination removes unreachable code that is never executed. 
For example, in C: 

if (0) 
a = 1 ; 
else 
a = 2; 

becomes: 

a = 2; 

Faster register allocation 

Faster register allocation involves: 

• I nserti ng entry and exit code 

• Generating code for operations such as multiplication and division 

• Eliminating unnecessary copy instructions 

• Allocating actual registers tothedummy registers in instructions 

Faster register allocation, when used at +oo or + 01 , analyzes register 
use faster than theglobal register allocation performed at + 02 . 
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Instruction scheduling 

The instruction scheduler optimization performs the foil owing tasks: 

• Reorders the instructions in a basic block to improve memory 
pipelining. For example, where possible, a load instruction is 
separated from the use of the loaded register. 

• Follows a branch instruction with an instruction that is executed as 
the branch occurs, where possible. 

• Schedules floating-point instructions. 

Peephole optimizations 

A peephole optimization is a machine-dependent optimization that 
makes a pass through low-level assembly-like instruction sequences of 
the program. It applies patterns to a small window (peephole) of code 
looking for optimization opportunities. It performs the following 
optimizations: 

• Changes the addressing mode of instructions so they use shorter 
sequences 

• Replaces low-level assembly-like instruction sequences with faster 
(usually shorter) sequences and removes redundant register loads 
and stores 
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Routine level optimizations (+02) 

At optimization level + 02 , thecompiler performs optimizations on a 
routine level. The compiler continues to perform the optimizations 
performed at + 01 , with the following additions: 

• Advanced constant folding and propagation 

• Common subexpression elimination 

• Global register allocation (GRA) 

• Loop-invariant code motion 

• Loop unrolling 

• Register reassociation 

• Software pipelining 

• Strength reduction of induction variables and constants 

• Store and copy optimization 

• Unused definition elimination 

Advanced constant folding and propagation 

Constant folding computes the value of a constant expression at compile 
time. Constant propagation istheautomaticcompile-timereplacement of 
variable references with a constant value previously assigned to that 
variable. 

Advanced constant folding and propagation 

The following C/C ++ code example describes an advanced constant 
folding and propagation: 

a = 10; 
b = a + 5; 
c = 4 * b; 

Once a is assigned, its value is propagated to the statement whereb is 
assigned so that the assignment reads: 

b = 10 + 5; 
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The expression 10 + 5 can then befolded. Now that b has been assigned 
a constant, the value of b is propagated to the statement where c is 
assigned. After all the folding and propagation, theoriginal code is 
replaced by: 

a = 10; 
b = 15; 
c = 60; 

Common subexpression elimination 

Common subexpression elimination optimization identifies expressions 
that appear more than once and have the same result. It then computes 
the result and substitutes the result for each occurrence of the 
expression. Subexpression types include instructions that load values 
from memory, as well as arithmetic evaluation. 

Common subexpression elimination 

I n Fortran, for example, the code first looks likethis: 

A = X + Y + Z 
B = X + Y + W 

After this form of optimization, it becomes: 

T1 = X + Y 
A = T1 + Z 
B = T1 + W 

Global register allocation (GRA) 

Scalar variables can often be stored in registers, eliminating the need for 
costly memory accesses. Global register allocation (GRA) attempts to 
storecommonly referenced scalar variables in registers throughout the 
code in which they are most frequently accessed. 

The compiler automatically determines which scalar variables are the 
best candidates for GRA and allocates registers accordingly. 

GRA can sometimes cause problems when parallel threads attempt to 
update a shared variable that has been allocated a register. I n this case, 
each parallel thread allocates a register for the shared variable; it is then 
unlikely that the copy in memory is updated correctly as each thread 
executes. 
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Parallel assignments tothe same shared variables from multiplethreads 
make sense only if the assignments are contained inside critical or 
ordered sections, or are executed conditionally based on the thread ID. 
GRA does not allocate registers for shared variables that are assigned 
within critical or ordered sections, as long as the sections are 
implemented using compiler directives or sync_routine-defined 
functions (refer to "Parallel synchronization,"on page243fora 
discussion of sync_routine). However, for conditional assignments 
based on the thread ID, GRA may allocate registers that may cause 
wrong answers when stored. 

I n such cases, GRA is disabled only for shared variables that are visible 
to multiplethreads by specifying +Onosharedgra. A description of this 
option is located in "40[no]sharedgra" section on page 138. 

I n procedures with large numbers of loops, GRA can contribute to long 
compile times. Therefore, GRA is only performed if the number of loops 
in the procedure is below a predetermined limit. You can remove this 
limit (and possibly increase compile time) by specifying +0 [no] limit. A 
description of this option is located in "40[no]limit” section on page 126. 

This optimization is also known as coloring register allocation because of 
the similarity to map-coloring algorithms in graph theory. 

Register allocation in C and C++ 

ln C and C++, you can help the optimizer understand when certain 
variables are heavily used within a function by declaring these variables 
with the register qualifier. 

GRA may override your choices and promote a variable not declared 
register to a register over a variablethat is declared register, based 
on estimated speed improvements. 
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Example 


Loop-invariant code motion 

The loop-invariant code motion optimization recognizes instructions 
inside a loop whose results do not change and then moves the 
instructions outside the loop. This optimization ensures that the 
invariant code is only executed once. 

Loop-invariant code motion 

This example begins with following C/C++code: 


x = z; 

for (i=0; i<10; i++) 

a [ i ] =4 * x + i; 

After loop-invariant code motion, it becomes: 

x = z; 
tl = 4 * x; 
for (i=0; i<10; i++) 
a [ i ] = 11 + i ; 

Loop unrolling 

Loop unrolling increases a loop's step value and replicates the loop body. 
Each replication is appropriately offset from the induction variableso 
that all iterations are performed, given the new step. 

U nrolling is total or partial. Total unrolling involves eliminating the loop 
structure completely by replicating the loop body a number of times 
equal to the iteration count and replacing the iteration variable with 
constants. This makes sense only for loops with small iteration counts. 

Loop unrolling and the unroll factor are controlled using the 

+0 [no] ioop_unroii [=unroii factor] . This option is described on 
page 127. 

Some loop transformations cause loops to be fully or partially replicated. 
Because unlimited loop replication can significantly increase compile 
times, loop replication is limited by default. You can increasethis limit 
(and possibly increase your program's compile time and code size) by 
specifying the +Onosize and +Onoiimit compiler options. 

Loop unrolling 

Consider thefollowing Fortran example: 

SUBROUTINE FOO(A,B) 

REAL A(10,10), B(10,10) 

DO J=l, 4 
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DO 1=1, 4 

A(I, J) = B(I, J) 

ENDDO 

ENDDO 

END 

The loop nest is completely unrolled as shown below: 
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Partial unrolling is performed on loops with larger or unknown iteration 
counts. This form of unrolling retains the loop structure, but replicates 
the body a number of times equal tothe unroll factor and adjusts 
references tothe iteration variable accordingly. 

Loop unrolling 

This example begins with thefollowing Fortran example: 

DO I = 1, 100 

A(I) = B (I) + C(I) 

ENDDO 

It is unrolled to a depth of four as shown below: 

DO I = 1, 100, 4 
A(I) = B (I) + C(I) 

A (1 + 1) = B (1 + 1) + C (1 + 1) 

A(1+2) = B(1+2) + C(1+2) 

A (1 + 3) = B(1 + 3) + C(1 + 3) 

ENDDO 

E ach iteration of the loop now computes four values of a instead of one 
value. The compiler also generates 'clean-up' code for the case where the 
range is not evenly divisible by the unroll factor. 
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Register reassociation 

Array references often require one or more instructions to compute the 
virtual memory address of the array element specified by the subscript 
expression. The register reassociation optimization implemented in 
PA-RI SC compilers tries to reduce the cost of computing the virtual 
memory address expression for array references found in loops. 

Within loops, the virtual memory address expression is rearranged and 
separated into a loop-variant term and a loop-invariant term. 

• Loop-variant terms arethose items whose values may change from 
one iteration of the loop to another. 

• Loop-invariant terms arethose items whose values are constant 
throughout all iterations of the loop. The loop-variant term 
corresponds to the difference i n the vi rtual memory address 
associated with a particular array reference from one iteration of the 
loop to the next. 

The register reassociation optimization dedicates a register to track the 
value of the virtual memory address expression for one or more array 
references in a loop and updates the register appropriately in each 
iteration of a loop. 

The register is initialized outside the loop to the loop-invariant portion of 
the virtual memory address expression. The register is incremented or 
decremented within the loop by the loop-variant portion of the virtual 
memory address expression. The net result is that array references in 
loops are converted into equivalent, but more efficient, pointer 
dereferences. 

Register reassociation can often enable another loop optimization. After 
performi ng the regi ster reassoci ation opti mization, the Ioop vari able may 
be needed only to control the iteration count of the loop. If this is the 
case, theoriginal loop variable is eliminated altogether by using the PA- 
RISC add ib and addb machine instructions to control theloop iteration 
count. 

You can enable or disable register reassociation using the 

+0 [no] regreassoc command-line option at +02 and above. Thedefault 

is +Oregreassoc. See "+0 [no] regreassoc" on page 136 for more 
information. 
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Register allocation 

This example begins with thefollowing C/C++code: 

int a[10] [20] [30]; 


void example (void) 

{ 

int i, j, k; 


for (k = 0; k < 10; k++) 
for (j = 0; j < 10;j++) 
for (i = 0; i < 10; i++) 
a [i] [ j] [k] = 1; 


After register reassociation is applied, the innermost loop becomes: 

int a[10][20][30]; 


void example (void) 

{ 

int i, j, k; 

register int (*p) [20] [30]; 

for (k = 0; k < 10; k+ + ) 
for (j = 0; j < 10; jt+) 

for (p = (int (*) [20] [30]) &a[0][j][k], i = 0; i < 10; 

i++) 


As you can see, the compiler-generated temporary register variable, p, 
strides through the array a in the innermost loop. This register pointer 
variable is initialized outside the innermost loop and auto-incremented 
within the innermost loop as a side-effect of the pointer dereference. 
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Software pipelining 

Software pipelining transforms code in order to optimize program loops. 
It achieves this by rearranging theorder in which instructions are 
executed in a loop. Software pipelining generates code that overlaps 
operations from different loop iterations. It is particularly useful for 
loops that contain arithmetic operations on real*4 and real*8 data in 
Fortran or on float and double data in C or C++. 

The goal of this optimization is to avoid processor stalls due to memory 
or hardware pipeline latencies. The software pipelining transformation 
partially unrolls a loop and adds code before and after the loop to achieve 
a high degree of optimization within the loop. 

You can enable or disable software pipelining using the 
+0 [no] pipeline command-line option at +02 and above. The default is 
+Opipeiine. Use +Onopipeiine if a smaller program size and faster 
compile time are more important than faster execution speed. See 
"+o [no] pipeline" on page 130 for more information. 

P rerequ i si tes of pi pel i ni ng 

Software pipelining is attempted on a loop that meets the following 
criteria: 

• It is the innermost loop 

• There are no branches or function calls within the loop 

• The loop is of moderate size 

This optimization produces slightly larger program files and increases 
compile time. It is most beneficial in programs containing loops that are 
executed many ti mes. 

Software pipelining 

The following C/C++example shows a loop before and after the software 
pipelining optimization: 

tdefine SIZ 10000 
float x [SIZ], y [SIZ]; 
int i; 
init (); 

for (i = 0;i<= SIZ;i++) 

x [i] = x [ i ] / y [i] + 4.00; 
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Four significant things happen in this example: 

• A portion of the first iteration of the loop is performed before the loop. 

• A portion of the last iteration of the loop is performed after the loop. 

• The loop is unrolled twice. 

• Operations from different loop iterations are interleaved with 
each other. 

When this loop is compiled with software pipelining, the optimization is 
expressed as follows: 


R1 = 0; 

R2 = 4.00; 

R3 = X[0] ; 

R4 = Y [0] ; 

R5 = R3 / R4; 

do { 


R6 = 

Rl; 

R1 + +; 


R7 = 

X[R1] ; 

R8 = 

Y[R1] ; 

R9 = 

R5 + R2; 

R10 = 

= R7 / R8 

X [R6; 

1 = R9; 

R6 = 

Rl; 

R1 + + ; 


R3 = 

X [ Rl ] ; 

R4 = 

Y [ Rl ] ; 


I nitializearray index 
Load constant value 
Load first X value 
Load first Y value 

Perform division on first element: n = 

X[0]/Y[0] 

Begin loop 

Save current array index 
I ncrement array index 
Load current X value 
Load current Y value 

Perform addition on prior row: x [ i ] = 

n + 4.00 

Perform division on current row: m = 

X[i + l]/Y [ i + 1] 

Save result of operations on prior row 
Save current array index 
I ncrement array index 
Load next X value 
Load next Y value 
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Rll = RIO + R2; 

R5 = R3 / R4; 

X [R6] = Rll; 

} while (R1 <= 100); 

R9 = R5 + R2; 


Perform addition on current row: 

X [i + l] = m + 4.00 

Perform division on next row: n = 

X[i+2]/Y[1+2] 

Save result of operations on current row 
E nd I oop 

Perform addition on last row: x [i+ 2 ] = 
n + 4.00 


x [R6] = R9; Save result of operations on last row 

This transformation stores intermediate results of the division 
instructions in unique registers (noted as n and m). These registers are 
not referenced until several instructions after thedivision operations. 
This decreases the possibility that the long latency period of thedivision 
instructions will stall the instruction pipeline and cause processing 
del ays. 


Strength reduction of induction variables 
and constants 

This optimization removes expressions that are linear functions of a loop 
counter and replaces each of them with a variable that contains the 
val ue of the function. Variables of the same I i near function are computed 
only once. This optimization also replaces multiplication instructions 
with addition instructions wherever possible. 

Strength reduction of induction variables and constants 

This example begins with thefollowing C/C++code: 

for (1=0; i<25; i + +) { 

r [ 1 ] = i * k; 

} 

After this optimization, it looks like this: 

tl = 0; 

for (1=0; i<25; i++) { 

r[i] = tl; 
tl += k; 

} 
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Store and copy optimization 

Where possible, the store and copy optimization substitutes registers for 
memory locations, by replacing store instructions with copy instructions 
and deleting load instructions. 

Unused definition elimination 

The unused definition elimination optimization removes unused memory 
location and register definitions. These definitions are often a result of 
transformations made by other optimizations. 

Unused definition elimination 

This example begins with thefollowing C/C++code: 

f(int x){ 
int a,b,c; 

a = 1; 
b = 2; 
c = x * b; 
return c; 

} 

After unused definition elimination, it looks likethis: 

f(int x) { 
int a,b,c; 

c = x * 2; 
return c; 

} 

The assignment a = 1 is removed because a is not used after it is 
defined. Dueto another +02 optimization (constant propagation), the 
c = x * b statement becomes c = x * 2 . The assignment b = 2 is 
then removed as wel I. 
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This chapter discusses loop optimization features available with the 
HP-UX compilers, including those inherent in optimization level + 03 . 
This includes a discussion of the following topics: 

• Strip mining 

• I nlining within a single source file 

• Cloning within a single source file 

• Data localization 

• Loop blocking 

• Loop distribution 

• Loop fusion 

• Loop interchange 

• Loop unroll and jam 

• P reventi ng I oop reorderi ng 

• Test promotion 

• Cross-module cloning 

For more information as to specific loop optimization command-line 
options, as well as related pragmas and directives for optimization, 
please see" "Controlling optimization," on page 113. 
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Loop and cross-module optimization features 

Strip mining 


Strip mining 

Strip mining is a fundamental +03 transformation. Used by itself, 
strip mining is not profitable. However, it is used by loop blocking, 
loop unroll and jam, and, in a sense, by parallelization. 

Strip mining involves splitting a single loop into a nested loop. The 
resulting inner loop iterates over a section or strip of the original loop, 
and the new outer loop runs the inner loop enough times to cover all the 
strips, achieving the necessary total number of iterations. The number of 
iterations of the inner loop is known as the loop’s strip length. 

Strip mining 

This example begins with the Fortran code below: 

DO I = 1, 10000 

A(I) = A (I) * B (I) 

ENDDO 

Strip mining this loop using a strip length of 1000 yields the following 
loop nest: 

DO IOUTER = 1, 10000, 1000 

DO ISTRIP = IOUTER, IOUTER+999 

A(I STRIP) = A(ISTRIP) * B(ISTRIP) 

ENDDO 

ENDDO 

I n this loop, the strip length integrally divides the number of iterations, 
so the loop is evenly split up. If the iteration count was not an integral 
multipleof the strip length—if i went from 1 to 10500 rather than 1 to 
10000, for example—the final iteration of thestrip loop would execute 
500 iterations instead of 1000. 
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Inlining within a single source file 

I nlining substitutes selected function calls with copies of the function's 
object code. Only functions that meet the optimizer's criteria are inlined. 

I nlining may result in slightly larger executable files. However, this 
increasein sizeisoffset by the elimination of time-consuming procedure 
calls and procedure returns. 

At + 03 , inlining is performed within a file; at + 04 , it is performed across 
files. I nlining is affected by the +o [no] inline [=namelist] and 
+oiniine_budget=n command-line options. See "Controlling 
optimization," on page 113 for more information. 

Inlining within single source file 

The following is an exampleof inlining at thesourcecode level. Before 
inlining, theC source file looks likethis: 

/* Return the greatest common divisor of two positive integers,*/ 
/* inti and int2, computed using Euclid's algorithm. (Return 0 */ 
/* if either is not positive.) */ 

int gcd(int intl,int int2) 

{ 

int inttemp; 

if ( (inti <= 0) || (int2 <= 0) ) { 

return(0) ; 


do { 

if (inti < int2) { 
inttemp = inti; 
inti = int2; 
int2 = inttemp; 

} 

inti = inti - int2; 

} while (inti > 0); 
return(int2) ; 

} 


main() 

{ 

int xval,yval,gcdxy; 

/* statements before call to gcd */ 
gcdxy = gcd(xval,yval); 

/* statements after call to gcd */ 
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After inlining, main looks likethis: 

main () 

{ 

int xval,yval,gcdxy; 

/* statements before inlined version of gcd */ 

{ 

int inti; 
int int2; 

inti = xval; 
int2 = yval; 

{ 

int inttemp; 

if ( (inti <= 0) || (int2 <= 0) ){ 

gcdxy = (0) ; 
goto AA003; 

} 

do { 

if (inti < int2){ 
inttemp = inti; 
inti = int2; 
int2 = inttemp; 

} 

inti = inti - int2; 

} while (inti > 0); 
gcdxy = (int2); 

} 

} 

AA003 : ; 

. /* statements after inlined version of gcd */ 

} 
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Cloning within a single source file 

Cloning replaces a call to a routine by calling a clone of that routine. The 
clone is optimized differently than the original routine. 

Cloning can expose additional opportunities for interprocedural 
optimization. At + 03 , cloning is performed within a file, and at + 04 , 
cloning is performed across files. Cloning is enabled by default, and is 
disabled by specifying the +Onoiniine command-line option. 
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Data localization 

Data localization occurs as a result of various loop transformations that 
occur at optimization levels +02 or +03. Because optimizations are 
cumulative, specifying +03 or +04 takes advantage of the 
transformations that happen at +02. 


Table 3 Loop transformations affecting data localization 


Loop 

transformation 

Options required for behavior to occur 

Loop unrolling 

+02 +Oloop_unroll 

(+Oioop_unroii is on by default at +02 and above) 

Loop distribution 

+03 +01oop_transform 

(+oioop_transf orm is on by default at +03 and above) 

Loop interchange 

+03 +01oop_transform 

(+oioop_transf orm is on by default at +03 and above) 

Loop blocking 

+03 +01oop_transform +01oop_block 
(+oioop_transf orm is on by default at +03 and above) 
(+Oioop_biock is off by default) 

Loop fusion 

+03 +01oop_transform 

(+oioop_transf orm is on by default at +03 and above) 

Loop unroll and 
jam 

+03 +01oop_transform +01oop_unroll_jam 
(+oioop_transf orm is on by default at +03 and above) 
(+Oioop_unroii_jam is off by default at +03 and above) 


Data localization keeps frequently used data in the processor data cache, 
eliminating the need for more costly memory accesses. 


Loops that manipulate arrays are the main candidates for localization 
optimizations. Most of these loops are eligible for the various 
transformations that the compiler performs at +03. These 
transformations areexplained in detail in this section. 

Some loop transformations cause loops to be fully or partially replicated. 
Because unlimited loop replication can significantly increase compile 
times, loop replication is limited by default. You can increasethis limit 
(and possibly increase your program's compile time and code size) by 
specifying the +Onosize and +Onoiimit compiler options. 
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NOTE 


Most of the following code examples demonstrate optimization by showing 
the original code first and optimized code second. The optimized code is 
shown in the same language as the original code for illustrative purposes 
only. 


Conditions that inhibit data localization 

Any of the foil owing conditions can inhibit or prevent data localization: 

• Loop-carried dependences (LCDs) 

• Other loop fusion dependences 

• Aliasing 

• Computed or assigned GOTO statements in Fortran 

• return or exit statements in C or C++ 

• throw statements in C++ 

• Procedure calls 

The following sections discuss these conditions and their effects on data 
localization. 

Loop-carried dependences (LCDs) 

A loop-carried dependence (LCD) exists when one iteration of a loop 
assigns a value to an address that is referenced or assigned on another 
iteration. In some cases, LCDs can inhibit loop interchange, thereby 
inhibiting localization. Typically, these cases involvearray indexes that 
are offset in opposite directions. 
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ToignoreLCDs, usetheno_ioop_dependence directive or pragma. The 
form of this directive and pragma is shown in Table 4. 

This directive and pragmas should only be used if you are certain that there 
are no loop dependences. Otherwise, errors will result. 


Form of no_ioop_dependence directive and pragma 


Language 

Form 

Fortran 

c$dir no_loop_dependence (namelist) 

C 

♦pragma _CNX no_loop_dependence (namelist) 


where 

namelist is a comma-separated list of variables or arrays that 

have no dependences for the immediately following 
loop. 

Loop-carried dependences 

The Fortran loop below contains an LCD that inhibits interchange: 

DO I = 2, M 
DO J = 2, N 

A(I,J) = A(1-1,J-l) + A (1-1,J+l) 

ENDDO 

ENDDO 

C and C ++1 oops can contain similar constructs, but to simplify 
illustration, only the Fortran example is discussed here. 

As written, this loop uses a (i-i, j-i) and a (i-i, J+l) to compute 
a (i, j) . Table 5 shows the sequence in which values of a are computed 
for this loop. 
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Computation sequence of a (i, j) : original loop 


I 

J 

A(I, J) 

A(I-1,J-l) 

A(I-1,J+l) 

2 

2 

A (2,2) 

A(l, 1) 

A(l,3) 

2 

3 

A (2,3) 

A Cl,2) 

A(l, 4) 

2 

4 

A (2,4) 

A(l,3) 

A(l,5) 

3 

2 

A (3, 2) 

A(2,1) 

A (2,3) 

3 

3 

A(3, 3) 

A (2,2 ) 

A (2,4) 

3 

4 

A (3, 4) 

A (2,3) 

A (2,5) 


As shown in Table 5, the original loop computes the elements of the 
current row of a using the elements of the previous row of a. For all rows 
except thefirst (which is never written), the values contained in the 
previous row must be written before the current row is computed. This 
dependence must be honored for the loop to yield its intended results. If a 
row element of a is computed beforethe previous row elements are 
computed, the result is incorrect. 

I nterchanging the i and j loops yields the following code: 

DO J = 2, N 
DO I = 2, M 

A(I,J) = A(1-1,J+l) + A(1-1,J-l) 

ENDDO 

ENDDO 

After interchange, the loop computes values of a in the sequence shown 
in Table 6. 
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Computation sequence of a (i, j) : interchanged loop 


I 

J 

A(I, J) 

A (I—1, J—1) 

A(1-1,J+l) 

2 

2 

A (2 , 2) 

A(l, 1) 

A(l,3) 

3 

2 

A (3 , 2 ) 

A(2 , 1) 

A (2 , 3) 

4 

2 

A (4 , 2 ) 

A(3, 1) 

A (3, 3) 

2 

3 

A (2 , 3) 

A(l,2) 

A (1 , 4) 

3 

3 

A (3, 3) 

A (2 , 2 ) 

A (2 , 4) 

4 

3 

A (4 , 3) 

A (3 , 2) 

A (3 , 4) 


Here, the elements of the current column of a are computed using the 
elements of the previous column and the next column of a. 

The problem here is that columns of a are being computed using 
elements from the next column, which have not been written yet. This 
computation violates the dependence illustrated in Table 5. 

Theelement-to-element dependences in both the original and 
interchanged loop are illustrated in Figure 9. 

LCDs in original and interchanged loops 

Original loop Intaeichanged loop 

j 

1 2 3 

1 
2 
3 


The arrows in Figure 9 represent dependences from one element to 
another, pointing at elements that depend on theelements at thearrows' 
bases. Shaded elements indicate a typical row or column computed in the 
inner loop: 



j 

12 3 








x 


/ 



x 

A A 
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• Darkly shaded elements have already been computed. 

• Lightly shaded elements have not yet been computed. 

This figure helps to illustrate the sequence in which the array elements 
are cycled through by the respective loops: the original loop cycles across 
all the columns in a row, then moves on tothe next row. The 
interchanged loop cycles down all the rows in a column first, then moves 
on tothe next column. 

Avoid loop interchange 

I nterchange is inhibited only by loops that contain dependences that 
change when the loop is interchanged. Most LCDs do not fall into this 
category and thus do not inhibit loop interchange. 

Occasionally, the compiler encounters an apparent LCD. If it cannot 
determine whether the LCD actually inhibits interchange, it 
conservatively avoids interchanging the loop. 

The following Fortran example illustrates this situation: 

DO I = 1, N 
DO J = 2, M 

A(I,J) = A(I+IADD,J+JADD) + B(I,J) 

ENDDO 

ENDDO 

I n these examples, if iadd and jadd are either both positive or both 
negative, the loop contains no interchange-inhibiting dependence. 
However, if one and only one of the variables is negative, interchange is 
inhibited. The compiler has no way of knowing the runtime values of 
iadd and jadd, so it avoids interchanging the loop. 

If you are positive that the iadd and jadd are both negative or both 
positive, you can tell the compiler that the loop is free of dependences 
using the no_ioop_dependence directive or pragma, described in this 
chapter Table 4 on page 60. 

The previous Fortran loop is interchanged when the 
no_loop_dependence directive is specified for a on the j loop as shown 
in the foil owing code: 

DO I = 1, N 

C$DIR NO_LOOP_DEPENDENCE(A) 

DO J = 2, M 

A(I,J) = A(I+IADD,J+JADD) + B(I,J) 

ENDDO 

ENDDO 
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If iadd and jadd acquire opposite-signed values at runtime, these loops 
may result in incorrect answers. 

Other loop fusion dependences 

I n some cases, loop fusion is also inhibited by simpler dependences than 
those that inhibit interchange. Consider the following Fortran example: 

DO I = 1, N-l 

A (I) = B (1 + 1) + C (I) 

ENDDO 

DO J = 1, N-l 

D (J) = A (J+l) + E (J) 

ENDDO 

While it might appear that loop fusion would benefit the preceding 
example, it would actually yield the following incorrect code: 

DO ITEMP = 1, N-l 

A(ITEMP) = B(ITEMP+1) + C(ITEMP) 

D(ITEMP) = A(ITEMP+1) + E(ITEMP) 

ENDDO 

This loop produces different answers than the original loops, because the 
reference to a (iTEMP + i) in thefused loop accesses a valuethat has not 
been assigned yet, whilethe analogous reference to a ( j+1) in the 
original j loop accesses a valuethat was assigned in the original i loop. 

Aliasing 

An alias is an alternate name for an object. Aliasing occurs in a program 
when two or more names are attached to the same memory location. 
Aliasing is typically caused in Fortran by use of the equivalence 
statement. The use of pointers normally causes the problem in C and 
C++. Passing identical actual arguments intodifferent dummy 
arguments in a Fortran subprogram can also cause aliasing, as can 
passing the same address intodifferent pointer arguments in a C or C++ 
function. 

Aliasing 

Aliasing interferes with data localization because it can mask LCDs 
where arrays a and b have been equivalenced. This is shown in the 
following Fortran example: 

INTEGER A(100,100), B(100,100), C(100,100) 

EQUIVALENCE(A, B) 
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DO I = 1, N 
DO J = 2, M 

A(I,J) = B(I-1,J+1) + C(I,J) 

ENDDO 

ENDDO 

This loop has the same problem as the loop used to demonstrate LCDs in 
the previous section; because a and b refer to the same array, the loop 
contains an LCD on a, which prevents interchange and thus interferes 
with localization. 

The C and C++equivalent of this loop follows. Keep in mind that C and 
C++store arrays in row-major order, which requires different 
subscripting to access the same elements. 

int a[100] [100], c[100] [100], i, j; 
int (*b)[100]; 
b = a; 


for(i=l;i<n;i++){ 
for(j=0;j<m;j++){ 

a [ j ] [ i] = b [ j + 1 ] [ i — 1 ] + c [ j ] [i] ; 

} 

} 

Fortran's equivalence statement is imitated in C and C++; through the 
use of pointers, arrays are effectively equivalenced, as shown. 

Passing the same address into different dummy procedure arguments 
can yield the same result. Fortran passes arguments by reference while 
C and C++pass them by value. However, pass-by-reference is simulated 
in C and C++by passing the argument's address intoa pointer in the 
receiving procedure or in C++by using references. 

Aliasing 

The following Fortran code exhibits the same aliasing problem as the 
previous example, but the alias is created by passing the same actual 
argument into different dummy arguments. 

The sample code below violates the Fortran standard. 


CALL ALI (A, A, C) 


SUBROUTINE ALI(A,B,C) 

INTEGER A(100,100), B(100,100), C(100,100) 
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DO J = 1, N 
DO I = 2, M 

A(I,J) = B(I-1,J+1) + C(I,J) 
ENDDO 
ENDDO 


The following (legal ANSI C) code shows the same argument-passing 
problem in C: 


ali(&a,&a,&c); 


void ali (a, b, c) 

int a[100] [ 100], b[100][100], c[100] [100]; 

{ 

int i,j; 

for ( j=0;j<n;j+ + ) { 
for (i = 1;i<m;i++) { 

a [ j] [ i] = b[j + 1] [i — 1] + c[j] [i] ; 
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Figure 10 


Computed or assigned goto statements in Fortran 

When the Fortran compiler encounters a computed or assigned goto 
statement in an otherwise interchangeable loop, it cannot always 
determine whether the branch destination is within the loop. Becausean 
out-of-loop destination would be a loop exit, these statements often 
prevent loop interchange and therefore data localization. 


I/O statements 

The order in which values are read into or written from a loop may 
change if the loop is interchanged. For this reason, I/O statements inhibit 
interchange and, consequently, data localization. 

I/O statements 

The following Fortran code is the basis for this example: 

DO I = 1, 4 
DO J = 1, 4 

READ *, IA(I,J) 

ENDDO 

ENDDO 

Given a data stream consisting of alternating zeros and ones 

(0,1,0,1,0,1...), the contents for a (i, j) for both the original loop and the 

interchanged loop are shown in Figure 10. 

Values read into array a 


Original loop 

j 



\—1 

2 

3 

4 

\—1 

0 

1 

0 

1 

2 

0 

1 

0 

1 

3 

0 

1 

0 

1 

4 

0 

1 

0 

1 


Interchanged loop 

j 



1 

2 

3 

4 

\—1 

1 

1 

1 

1 

2 

0 

0 

0 

0 

3 

1 

1 

1 

1 

4 

0 

0 

0 

0 
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Multiple loop entries or exits 

Loops that contain multiple entries or exits inhibit data localization 
because they cannot safely be interchanged. Extra loop entries are 
usually created when a loop contains a branch destination. Extra exits 
are more common, however. These are often created i n C and C++using 
the break statement, and in Fortran using the goto statement. 

As noted before, the order of computation changes if the loops are 
interchanged. 

Multiple loop entries or exits 

This example begins with thefollowing C code: 

for(j=0;j<n;j++) { 
for(i=0;i<m;i++){ 

a [ i ] [ j ] = b[i] [ j] + c[i] [ j] ; 
if(a[i][j] == 0) break; 


I nterchanging this loop would change the order in which the values of a 
are computed. The original loop computes a column-by-column, whereas 
the interchanged loop would compute it row-by-row. This means that the 
interchanged loop may hit thebreak statement and exit after computing 
a different set of elements than the original loop computes. I nterchange 
therefore may cause the results of the loop to differ and must be avoided. 

return or stop statements in Fortran 

Like loops with multiple exits, return and stop statements in Fortran 
inhibit localization becausethey inhibit interchange. If a loop containing 
a return or stop is interchanged, its order of computation may change, 
giving wrong answers. 

return or exit statements in C or C++ 

Similar to Fortran's return and stop statements (discussed in the 
previous section), return and exit statements in C and C++inhibit 
localization becausethey inhibit interchange. 
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throw statements in C++ 

I n C++, throw statements, like loops containing multiple exits, inhibit 
localization because they inhibit interchange. 

Procedure calls 

H P compilers are unaware of the side effects of most procedures, and 
therefore cannot determine whether or not they might interfere with 
loop interchange. Consequently, the compilers do not perform loop 
interchange in an embedded procedure call. These side effects may 
include data dependences involving loop arrays, aliasing (as described in 
the section "Aliasing" section on page 64), and processor data cache that 
use conflicts with the loop’s cache. This renders useless any data 
localization optimizations performed on the loop. 

The compiler can loop parallel on a loop with a procedure call if it can verify 
that the procedure will not cause any side effects. 
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Loop blocking 

Loop blocking is a combination of strip mining and interchange that 
maximizes data localization. It is provided primarily to deal with nested 
loops that manipulate arrays that are too large to fit into the cache. 
Under certain circumstances, loop blocking allows reuse of these arrays 
by transforming the loops that mani pul ate them so that they manipulate 
strips of the arrays that fit into the cache. Effectively, a blocked loop 
accesses array elements in sections that are optimally sized to fit in the 
cache. 

The loop-blocking optimization isonly availableat +03 (and above) inthe 
HP compilers; it is disabled by default. To enable loop blocking, use the 
+Oioop_biock option. Specifying +Onoioop_biock (thedefault) 
disables both automatic and directive-specified loop blocking. Specifying 
+Onoioop_transform alsodisables loop blocking, as well as loop 
distribution, loop interchange, loop fusion, loop unroll, and loop unroll 
and jam. 

Loop blocking can also be enabled for specific loops using the 
biock_ioop directive and pragma. Thebiock_ioop and 
no_biock_ioop directives and pragmas affect the immediately 
following loop. You can also instruct the compiler to use a specific block 
factor using biock_ioop. Theno_biock_ioop directive and pragma 
disables loop blocking for a particular loop. 

The forms of these directives and pragmas is shown in Table 7. 


Table 7 Forms of biock_ioop, no_biock_ioop directives and pragmas 


Language 

Form 

Fortran 

C$DIR BLOCK_LOOP[(BLOCK_FACTOR = n)] 


C$DIR NO_BLOCK_LOOP 

C 

#pragma _CNX block_loop[(block_factor = n)] 


#pragma _CNX no_block_loop 
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where 

n is the requested block factor, which must be a 

compile-time integer constant. The compiler uses this 
value as stated. For the best performance, the block 
factor multiplied by the data type size of the data in the 
loop should be an integral multiple of the cache line 
size. 

I n the absence of the biock_f actor argument, this directive is useful 
for indicating which loop in a nest to block. I n this case, the compiler 
uses a heuristic to determine the block factor. 

Data reuse 

Data reuse is important to understand when discussing blocking. There 
are two types of data reuse associated with loop blocking: 

• Spatial reuse 

• Temporal reuse 

Spatial reuse 

Spatial reuse uses data that was encached as a result of fetching another 
piece of data from memory; data is fetched by cache lines. 32 bytes of 
data is encached on every fetch on V2250 servers. Cache line sizes may 
be different on other H P SM Ps. 

On the initial fetch of array data from memory within a stride-one loop, 
the requested item is located anywhere in the 32 bytes. The exception is 
if array is aligned on cache line boundaries. Refer to "Standard 
optimization features,” on page 35, for a description of data alignment. 

Starting with the cache-aligned memory fetch, the requested data is 
located at the beginning of thecache line, and the rest of the cache line 
contains subsequent array elements. For a real*4 array, this means the 
requested element and the seven following elements are encached on 
each fetch after the first. 

I f any of these seven elements could then be used on any subsequent 
iterations of the loop, the loop would be exploiting spatial reuse. For 
loops with strides greater than one, spatial reuse can still occur. 
However, thecache lines contain fewer usable elements. 


Chapter 5 


71 




Example 


Loop and cross-module optimization features 

Loop blocking 


Temporal reuse 

Temporal reuse uses the same data item on more than one iteration of 
the loop. An array element whose subscript does not change as a function 
of the iterations of a surrounding loop exhibits temporal reuse in the 
context of the loop. 

Loops that stride through arrays are candidates for blocki ng when there 
is alsoan outermost loop carrying spatial or temporal reuse. Blocking the 
innermost loop allows data referenced by the outermore loop to remain in 
the cache across multiple iterations. Blocking exploits spatial reuse by 
ensuring that once fetched, cache lines are not overwritten until their 
spatial reuse is exhausted. Temporal reuse is similarly exploited. 

Simple loop blocking 

I n order to exploit reuse in more realistic examples that manipulate 
arrays that do not all fit in the cache, the compiler can apply a blocking 
transformation. 

The following Fortran example demonstrates this: 

REAL*8 A(1000,1000),B(1000,1000) 

REAL*8 C (1000),D (1000) 

COMMON /BLK2/ A, B, C 


DO J = 1, 1000 
DO I = 1, 1000 

A (I, J) = B ( J, I) + C (I) + D ( J) 

ENDDO 

ENDDO 

Here the array elements occupy nearly 16 M bytes of memory. Thus, 
blocking becomes profitable. 

First the compiler strip mines the i loop: 

DO J = 1, 1000 

DO IOUT = 1, 1000, IBLOCK 

DO I = IOUT, IOUT+IBLOCK-1 

A (I, J) = B (J, I) + C(I) + D (J) 

ENDDO 

ENDDO 

ENDDO 

iblock is the block factor (also referred to as the strip mine length) the 
compiler chooses based on the size of the arrays and size of the cache. 
Notethat this example assumes thechosen iblock divides 1000 evenly. 
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Next, the compiler moves the outer strip loop (iout) outward as far as 
possible. 

DO IOUT = 1, 1000, IBLOCK 
DO J = 1, 1000 

DO I = IOUT, IOUT+IBLOCK-1 

A (I, J) = B (J, I) + C(I) + D (J) 

ENDDO 

ENDDO 

ENDDO 

This new nest accesses iblock rows of a and iblock columns of b for 
every iteration of j. At every iteration of iout, the nest accesses 1000 
iBLOCK-length columns of a (or an iblock x 1000 chunk of a) and 1000 
iBLOCK-width rows of b are accessed. This is illustrated in Figure 11. 

Figure 11 Blocked array access 

B 



=1 IBLOCK+1 -IBLOCK 

Fetches of a encachethe needed element and the three elements that are 
used in the three subsequent iterations, giving spatial reuse on a. 
Because the i loop traverses columns of b, fetches of b encache extra 
elements that are not spatially reused until j increments, iblock is 
chosen by the compiler to efficiently exploit spatial reuse of both a and b. 

Figure 12 illustrates how cache lines of each array are fetched, a and b 
both start on cache line boundaries becausethey are in common. The 
shaded area represents the initial cache line fetched. 
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Spatial reuse of a and b 
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• When A(l,l) is accessed, A(l:4,l) is fetched; A(2:4,l) is used on 
subsequent iterations 2,3 and 4 of I. 

• B(l:4,l) is fetched when I =1, but B(2:4,l) is not be used until J 
increments to 2, 3, 4. B(1:4,2) is fetched when I =2. 

Typically, i block elements of c remain in thecachefor several 
iterations of j before being overwritten, giving temporal reuse on c for 
those iterations. By thetimeany of thearrays are overwritten, all spatial 
reuse has been exhausted. The load of d is removed from the i loop so 
that it remains in a register for all iterations of i. 

Matrix multiply blocking 

The more complicated matrix multiply algorithm, which follows, is a 
prime candidate for blocking: 

REAL*8 A(1000,1000),B(1000,1000),C(1000,1000) 

COMMON /BLK3/ A, B, C 


DO I = 1, 1000 
DO J = 1, 1000 
DO K = 1, 1000 

C(I,J) = C(I,J) + A (I, K) * B(K,J) 
ENDDO 
ENDDO 
ENDDO 
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This loop is blocked as shown below: 

DO IOUT = 1, 1000, IBLOCK 
DO KOUT = 1, 1000, KBLOCK 
DO J = 1, 1000 

DO I = IOUT, IOUT+IBLOCK-1 
DO K = KOUT, KOUT+KBLOCK-1 

C(I,J) = C(I,J) + A (I, K) * B (K, J) 

ENDDO 

ENDDO 

ENDDO 

ENDDO 

ENDDO 

As a result, the following occurs: 

• Spatial reuse of b with respect to the k loop 

• Temporal reuse of b with respect to the i I oop 

• Spatial reuse of a with respect to the i loop 

• Temporal reuse of a with respect to the j I oop 

• Spatial reuseof c with respect to the i loop 

• Temporal reuse of c with respect to the k I oop 

An analogous C and C-H-example follows with a different resulting 
interchange: 

static double a [ 1000] [ 1000], b[1000][ 1000]; 
static double c[1000] [1000]; 


for (i=0;i<1000;i + + ) 

for ( j=0;j<1000;j + + ) 
for (k=0;k<1000;k++) 

c [ i ] [ j ] = c[i] [j] +a[i][k] * b [ k] [ j ] ; 

The H P C and aC-H-compilers interchange and block the loop in this 
example to provide optimal access efficiency for the row-major C andC-H- 
arrays. The blocked loop is shown below: 

for ( jout=0;jout<1000;jout+=jblk) 
for (kout=0;kout<l000;kout+=kblk) 
for (i=0;i<1000;i + + ) 

for(j = jout;j< jout+ jblk;j++) 
for (k=kout;k<kout+kblk;k++) 

c [ i ] [ j] =c [i] [ j ] +a [i] [k]*b[k] [j]; 
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Loop blocking 


As you can see, the interchange was done differently because of C and 
C-H-'s different array storage strategies. This code yields: 

• Spatial reuse of b with respect to the j loop 

• Temporal reuse of b with respect to the i I oop 

• Spatial reuse of a with respect to the k loop 

• Temporal reuse of a with respect to the j I oop 

• Spatial reuse on c with respect to the j loop 

• Temporal reuse on c with respect to the k loop 

Blocking is inhibited when loop interchange is inhibited. If a candidate 
loop nest contains loops that cannot be interchanged, blocking is not 
performed. 

Loop blocking 

The foil owing example shows the affect of thebiock_ioop directive on 
the code shown earlier in "Matrix multiply blocking” section on page 74: 

REAL*8 A(1000, 1000),B(1000,1000) 

REAL*8 C (1000, 1000) 

COMMON /BLK3/ A, B, C 


DO I = 1,1000 
DO J = 1, 1000 

C$DIR BLOCK_LOOP(BLOCK_FACTOR = 112) 

DO K = 1,1000 

C(I,J) = C(I,J) + A (I, K) *B(K,J) 

ENDDO 

ENDDO 

ENDDO 

The original example involving this code showed that the compiler blocks 
the i and k loops. I n this example, the block_loop directive instructs 
the compiler to use a block factor of 112 for theK loop. This is an efficient 
blocking factor for this example because 112 x 8 bytes =896 bytes, 
and 896/32 bytes (the cache line size) =28, which is an integer, so partial 
cache lines are not necessary. The compiler-chosen value is still used on 
the i loop. 
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Table 8 


Example 


Loop distribution 

Loop distribution is another fundamental +03 transformation necessary 
for more advanced transformations. These advanced transformations 
require that all calculations in a nested loop be performed insidethe 
innermost loop. Tofacilitate this, loop distribution transforms 
complicated nested loops into several simple loops that contain all 
computations insidethe body of the innermost loop. 

Loop distribution takes pi ace at +03 and above and is enabled by default. 
Specifying +Onoioop_transform disables loop distribution, as well as 
loop interchange, loop blocking, loop fusion, loop unroll, and loop unroll 
and jam. 

Loop distribution is disabled for specific loops by specifying the 
no_distribute directive or pragma immediately beforethe loop. 

The form of this directive and pragma is shown in Table8. 


Form of no_distribute directive and pragma 


Language 

Form 

Fortran 

C$DIR NO_DISTRIBUTE 

C 

#pragma _CNX no_distribute 


Loop distribution 

This example begins with thefollowing Fortran code: 

DO I = 1, N 
C(I) = 0 
DO J = 1, M 

A (I, J) = A (I, J) + B (I, J) * C(I) 

ENDDO 

ENDDO 

Loop distribution creates two copies of the i loop, separating the nested 
j loop from the assignments to array c. In this way, all assignments are 
moved to innermost loops. I nterchange is then performed on the i and j 
loops. 
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Loop distribution 


The distribution and interchange is shown in the foil owing transformed 
code: 

DO I = 1, N 
C(I) = 0 
ENDDO 

DO J = 1, M 
DO I = 1, N 

A(I,J) = A(I,J) + B(I,J) * C (I) 

ENDDO 

ENDDO 

Distribution can improve efficiency by reducing the number of memory 
references per loop iteration and the amount of cache thrashing. It also 
creates more opportunities for interchange. 
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Example 


Loop fusion 

Loop fusion involves creating one loop out of two or more neighboring 
loops that have identical loop bounds and trip counts. This reduces loop 
overhead, memory accesses, and increases register usage. It can also lead 
toother optimizations. By potentially reducing the number of 
paral leiizable loops in a program and increasing the amount of work in 
each of those loops, loop fusion can greatly reduce parallelization 
overhead. Because fewer spawns and joins are necessary. 

Loop fusion takes place at +03 and above and is enabled by default. 
Specifying +Onoioop_transform disables loop fusion, as well as 
loop distribution, loop interchange, loop blocking, loop unroll, and 
loop unroll and jam. 

Occasionally, loops that do not appear to be fusible become fusible as a 
result of compiler transformations that precede fusion. For instance, 
interchanging a loop may make it suitablefor fusing with another loop. 

Loop fusion is especially beneficial when applied to Fortran array 
assignments. The compiler translates these statements into loops; when 
such loops do not contain code that inhibit fusion, they are fused. 

Loop fusion 

This example begins with thefollowing Fortran code: 

DO I = 1, N 

A(I) = B (I) + C(I) 

ENDDO 

DO J = 1, N 

IF(A(J) .LT. 0) A(J) = B(J)*B(J) 

ENDDO 

The two loops shown above are fused into the foil owing loop using loop 
fusion: 

DO I = 1, N 

A (I) = B (I) + C(I) 

IF(A(I) .LT. 0) A(I) = B(I)*B(I) 

ENDDO 
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Loop fusion 

Loop fusion 

This example begins with thefollowing Fortran code: 

REAL A(100,100), B(100,100), C(100,100) 


C = 2.0 * B 
A = A + B 

The compiler first transforms these Fortran array assignments into 
loops, generating code similar to that shown below. 

DO TEMPI = 1, 100 
DO TEMP2 = 1, 100 

C(TEMP2, TEMPI) = 2.0 * B(TEMP2, TEMPI) 

ENDDO 

ENDDO 

DO TEMP3 = 1, 100 
DO TEMP4 = 1, 100 

A(TEMP4,TEMP3)=A(TEMP4,TEMP3)+B(TEMP 4,TEMP3) 

ENDDO 

ENDDO 

These two loops would then be fused as shown in thefollowing loop nest: 

DO TEMPI = 1, 100 
DO TEMP2 = 1, 100 

C(TEMP2,TEMPI) = 2.0 * B(TEMP2, TEMPI) 

A(TEMP 2,TEMP 1)=A(TEMP 2,TEMP 1)+B(TEMP 2,TEMP 1) 

ENDDO 

ENDDO 

Further optimizations could be applied to this new nest as appropriate. 

Loop peeling 

When trip counts of adjacent loops differ by only a single iteration (+1 
or -1), thecompiler may peel an iteration from oneof the two loops so 
that the loops may then be fused. The peeled iteration is performed 
separately from the original loop. 

Thefollowing Fortran example shows how this is implemented: 

DO I = 1, N-l 
A (I) = I 
ENDDO 

DO J = 1, N 

A (J) = A (J) +1 
ENDDO 
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Loop fusion 


As you can see, the Nth iteration of the j loop is peeled, resulting in atrip 
count of n - l. The Nth iteration is performed outside the j loop. As a 
result, thecode is changed to the foil owing: 

DO I = 1, N-l 
A(I) = I 
ENDDO 

DO J = 1, N-l 
A (J) = A (J) +1 
ENDDO 

A (N) = A (N) + 1 

The i and j loops now have the same tri p count and are fused, as shown 
below: 


DO I = 

\—1 

1 

\—1 

A (I) 

= I 

A (I) 

= A (I) 

ENDDO 


A (N) = 

A (N) + 
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Loop interchange 

The compiler may interchange (or reorder) nested loops for the following 
reasons: 

• To facilitate other transformations 

• To relocate the loop that isthemost profitableto parallelize so that it 
is outermost 

• To optimize inner-loop memory accesses 

Loop interchange takes pi ace at +03 and aboveand isenabled by default. 
Specifying +Onoioop_transform disables loop interchange, as well as 
loop distribution, loop blocking, loop fusion, loop unroll, and loop unroll 
and jam. 

Loop interchange 

This example begins with the Fortran matrix addition algorithm below: 

DO I = 1, N 
DO J = 1, M 

A (I, J) = B (I, J) + C (I, J) 

ENDDO 

ENDDO 

The loop accesses the arrays a, b and c row by row, which, in Fortran, is 
very inefficient. I nterchanging the i and j loops, as shown in the 
following example, facilitates column by column access. 

DO J = 1, M 
DO I = 1, N 

A (I, J) = B (I, J) + C (I, J) 

ENDDO 

ENDDO 

U nlike Fortran, C and C++access arrays in row-major order. An 
analogous example in C and C++, then, employs an opposite nest 
ordering, as shown below. 

for ( j=0;j<m;j++) 
for(i=0;i<n;i++) 

a[i] [ j] = b[i] [ j] + c[i] [ j] ; 
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Loop interchange 


I interchange facilitates row-by-row access. The interchanged loop is 
shown below. 

for (i=0;i<n;i + +) 
for(j=0;j<m;j + + ) 

a [ i ] [ j] = b[i] [ j] + c [ i ] [ j] ; 
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Loop unroll and jam 

The loop unroll and jam transformation is primarily intended to increase 
register exploitation and decrease memory loads and stores per 
operation within an iteration of a nested loop. I mproved register usage 
decreases the need for main memory accesses and allows better 
exploitation of certain machine instructions. 

Unroll and jam involves partially unrolling oneor more loops higher in 
the nest than the innermost loop, and fusing ("jamming”) the resulting 
loops back together. For unroll and jam to be effective, a loop must be 
nested and must contain data references that aretemporally reused with 
respect to some loop other than the innermost (temporal reuse is 
described in "Data reuse" section on page 71). The unroll and jam 
optimization is automatically applied only to those loops that consist 
strictly of a basic block. 

Loop unroll and jam takes place at +03 and above and is not enabled by 
default in the H P compilers. To enable loop unroll and jam on the 
command line, use the +Oioop_unroii_jam option. This allows both 
automatic and directive-specified unroll and jam. Specifying 
+Onoioop_transf orm disables loop unroll and jam, loop distribution, 
loop interchange, loop blocking, loop fusion, and loop unroll. 

The unroii_and_jam directive and pragma also enables this 
transformation. The no_unroii_and_jam directive and pragma is used 
to disable loop unroll and jam for an individual loop. 
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Loop unroll and jam 


The forms of these directives and pragmas are shown in Table 9. 


Table 9 Forms of unroii_and_jam, no_unroii_and_ jam directives and 

pragmas 


Language 

Form 

Fortran 

C$DIR UNROLL_AND_JAM[(UNROLL_FACTOR=n)] 


C$DIR NO_UNROLL_AND_JAM 

C 

#pragma _CNX unroll_and_jam[(unroll_factor=n)] 


#pragma _CNX no_unroll_and_jam 


where 

unroii_f actor=n allows you to specify an unroll factor 

for the loop in question. 

NOTE Because unroll and jam is only performed on nested loops, you must ensure 

that the directive or pragma is specified on a loop that, after any compiler- 
initiated interchanges, is not the innermost loop. You can determine which 
loops in a nest are innermost by compiling the nest without any directives 
and examining the Optimization Report, described in “Optimization Report,” 
on page 151. 

Example Unroll and jam 

Consider the following matrix multiply loop: 

DO I = 1, N 
DO J = 1, N 
DO K = 1, N 

A (I, J) = A (I, J) + B (I, K) * C(K,J) 

ENDDO 

ENDDO 

ENDDO 

Here, the compiler can exploit a maximum of 3 registers: one for a ( i, j) , 
one for b (i, k) , and one for c (k, j) . 

Register exploitation is vastly increased on this loop by unrolling and 
jamming the i and j loops. First, the compiler unrolls the i loop. To 
simplify the illustration, an unrolling factor of 2 for i is used. This is the 
number of times the contents of the loop are replicated. 
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Loop unroll and jam 


The following Fortran example shows this replication: 

DO I = 1, N, 2 
DO J = 1, N 
DO K = 1, N 

A(I,J) = A(I,J) + B(I,K) * C(K,J) 

ENDDO 

ENDDO 

DO J = 1, N 
DO K = 1, N 

A(1+1,J) = A(1+1,J) + B(1+1,K) * C(K,J) 

ENDDO 

ENDDO 

ENDDO 

The "jam" part of unroll and jam occurs when the loops are fused back 
together, to create the foil owing: 

DO I = 1, N, 2 
DO J = 1, N 
DO K = 1, N 

A (I, J) = A (I, J) + B (I, K) * C(K,J) 

A (1 + 1, J) = A (1 + 1, J) + B (1 + 1, K) * C(K,J) 

ENDDO 

ENDDO 

ENDDO 

This new loop can exploit registers for two additional references: a (i, j) 
and a (i + i, j). However, the compiler still has the j loop to unroll and 
jam. An unroll factor of 4 for the j loop is used, in which case unrolling 
gives the following: 

DO I = 1, N, 2 
DO J = 1, N, 4 
DO K = 1, N 

A (I, J) = A (I, J) + B (I, K) * C(K,J) 

A (1 + 1, J) = A (1 + 1, J) + B (1 + 1, K) * C (K, J) 

ENDDO 

DO K = 1, N 

A (I, J+l) = A (I, J+l) + B (I, K) * C (K, J+l) 

A (1 + 1, J+1) = A (1 + 1, J+l) + B (1 + 1, K) * C (K, J+l) 

ENDDO 

DO K = 1, N 

A (I, J+2 ) = A (I, J+2 ) + B (I, K) * C (K, J+2 ) 

A(1+1,J+2) = A(1+1,J+2) + B(1+1,K) * C(K,J+2) 

ENDDO 

DO K = 1, N 

A(I,J+3) = A(I,J+3) + B(I,K) * C(K,J+3) 

A(1+1,J+3) = A(1+1,J+3) + B(1+1,K) * C(K,J+3) 

ENDDO 

ENDDO 

ENDDO 
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Loop unroll and jam 


Fusing (jamming) the unrolled loop results in the following: 


DO I = 1, N, 2 
DO J = 1, N, 4 
DO K = 1, N 

A(I,J) = A(I,J) + B(I,K) * C(K, J) 

A(I+1,J) = A(1+1,J) + B(I+1,K) * C(K,J) 
A(I,J+1) = A(I,J+l) + B(I,K) * C(K,J+l) 

A(1+1,J+l) = A(1+1,J+l) + B(I+1,K) * C(K,J+l) 
A(I,J+2) = A(I,J+2) + B(I,K) * C(K,J+2) 

A(1+1,J+2) = A(1+1,J+2) + B(I+1,K) * C(K,J+2) 
A(I,J+3) = A(I,J+3) + B(I,K) * C(K,J+3) 

A(I+1,J+3) = A(I+1,J+3) + B(I+1,K) * C(K,J+3) 
ENDDO 
ENDDO 
ENDDO 


This new loop exploits more registers and requires fewer loads and stores 
than the original. Recall that the original loop could use no more than 3 
registers. This unrolled-and-jammed loop can use 14, one for each of the 
following references: 


NOTE 


A(I, J) 

B(1+1,K) 

A(I,J+2) 

A(1+1,J+3) 


B (I, K) 

A(I,J+l) 
C(K,J+2) 
C(K,J+3) 


C (K, J) 

C(K,J+l) 
A(I,J+3) 


A(1+1,J) 
A(I+1,J+l) 
A(1+1,J+2) 


Fewer loads and stores per operation are required because all of the 
registers containing these elements are referenced at least twice. This 
particular example can also benefit from the PA-RI SC fmpyfadd 
instruction, which is available with PA-8x00 processors. This instruction 
doubles the speed of the operations in the body of the loop by 
simultaneously performing related adds and multiplies. 

This is a very simplified example. I n reality, thecompiler attempts to 
exploit as many of the PA-RI SC processor's registers as possible. For the 
matrix multiply algorithm used here, thecompiler would select a larger 
unrolling factor, creating a much larger k loop body. This would result in 
increased register exploitation and fewer loads and stores per operation. 

Excessive unrolling may introduce extra register spills if the unrolled and 
jammed loop body becomes too large. Each cache line has a 32-bit register 
value; register spills occur when this value is exceeded. This most often 
occurs as a result of continuous loop unrolling. Register spills may have 
negative effects on performance. 
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You should attempt to select unroll factor values that align data 
references in the innermost loop on cache boundaries. As a result, 
references to the consecutive memory regions in the innermost loop can 
have very high cache hit ratios. U nroll factors of 5 or 7 may not be good 
choices because most array element sizes are either 4 bytes or 8 bytes 
and the cache line size is 32 bytes. Therefore, an unroll factor of 2 or 4 is 
more likely to effectively exploit cache line reuse for the references that 
access consecutive memory regions. 

As with all optimizations that replicate code, the number of new loops 
created when the compiler performs the unroll and jam optimization is 
limited by default to ensure reasonable compile times. To increase the 
replication limit and possibly increase your compile time and code size, 
specify the +Onosize and +Onoiimit compiler options. 
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Table 10 


Preventing loop reordering 

The no_ioop_transf orm directive or pragma allows you to prevent all 
loop-reordering transformations on the immediately following loop. 

The form of this directive and pragma are shown in Table 10. 

Form of no_ioop_transform directive and pragma 


Language 

Form 

Fortran 

C$DIR N0_L00P_TRANSF0RM 

C 

#pragma _CNX no_loop_transform 


Use the command-line option +Onoioop_transform (at +03 and above) 
todisable loop distribution, loop blocking, loop fusion, loop interchange, 
loop unroll, and loop unroll and jam at thefile level. 
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Test promotion 


Test promotion 

Test promotion involves promoting a test out of the loop that encloses it 
by replicating the containing loop for each branch of the test. The 
replicated loops contain fewer tests than the originals, or no tests at all, 
sothe loops execute much faster. M ulti pie tests are promoted, and copies 
of the loop are made for each test. 

Test promotion 

Consider the following Fortran loop: 

DO 1=1, 100 
DO J=l, 100 

IF(FOO .EQ. BAR) THEN 
A (I, J) = I + J 
ELSE 

A (I, J) = 0 
END IF 
ENDDO 
ENDDO 

Test promotion (and loop interchange) produces thefollowing code: 

IF(FOO .EQ. BAR) THEN 
DO J=l, 100 
DO 1=1, 100 

A (I, J) = I + J 
ENDDO 
ENDDO 
ELSE 

DO J=l, 100 
DO 1=1, 100 
A (I, J) = 0 
ENDDO 
ENDDO 
END IF 

For loops containing large numbers of tests, loop replication can greatly 
i ncrease the size of the code. 

Each do loop in Fortran and for loop in C and C++whose bounds are not 
known at compile-time is implicitly tested to check that the loop iterates 
at least once. This test may be promoted, with the promotion noted in the 
Optimization Report. If you see unexpected promotions in the report, 
this implicit testing may bethecause. For more information on the 
Optimization Report, see "Optimization Report,"on page 151. 
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Cross-module cloning 

Cloning is the replacement of a call to a routine by a call to a clone of that 
routine. The clone is optimized differently than the original routine. 
Cloning can expose additional opportunities for optimization across 
multiple source files. 

Cloning at +04 is performed across all procedures within the program, 
and isdisabled by specifying the +Onoiniine command-line option. This 
option is described on page 124. 

Global and static variable optimizations 

Global and static variable optimizations look for ways to reduce the 
number of instructions required for accessing global and static variables 
(common and save variables in Fortran, and extern and static 
variables in C and C++). 

The compiler normally generates two machine instructions when 
referencing global variables. Depending on the locality of the global 
variables, single machine instructions may sometimes be used to access 
these variables. The linker rearranges the storage location of global and 
static data to i ncrease the number of variables that are referenced by 
single instructions. 

Global variable optimization coding standards 

Because this optimization rearranges the location and data alignment of 
global variables, follow the programming practices given below: 

• Do not make assumptions about the relative storage location of 
variables, such as generating a pointer by adding an offset to the 
address of another variable. 

• Do not rely on pointer or address comparisons between two different 
variables. 

• Do not make assumptions about the alignment of variables, such as 
assuming that a short integer is aligned the same as an integer. 
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Inlining across multiple source files 

I nlining substitutes function calls with copies of the function's object 
code. Only functions that meet the optimizer's criteria are inlined. This 
may result in slightly larger executable files. However, this increase in 
size is offset by the elimination of time-consuming procedure calls and 
procedure returns. Seethesection "Inlining within a single source file" 
section on page 55 for an example of inlining. 

I nlining at +04 is performed across all procedures within the program. 

I nlining at +03 is done within one file. 

I nlining is affected by the +0 [no] inline [=namelist] and 
+oiniine_budget=n command-line options. See "Controlling 
optimization," on page 113 for more information on these options. 
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This chapter discusses parallel optimization features available with the 
HP-UX compilers, including those inherent in optimization levels +03 
and + 04 . This includes a discussion of the foil owing topics: 

• Levels of parallelism 

• Threads 

• I die thread states 

• Parallel optimizations 

• Inhibiting parallelization 

• Reductions 

• Preventing parallelization 

• Parallelism in theaC-H-compiler 

• Cloning across multiplesourcefiles 

For more information as to specific parallel command-line options, as 
well as pragmas and directives, please see "Controlling optimization,"on 
page 113. 
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Levels of parallelism 

I n the HP compilers, parallelism exists at the loop level, task level, and 
region level, as described in Chapter 9, "Parallel programming 
techniques”. These are briefly described as follows. 

• HP compilers automatically exploit loop-level parallelism. This type 
of parallelism involves dividing a loop into several smaller iteration 
spaces and scheduling these to run simultaneously on the available 
processors. For more information, see "Parallelizing loops” section on 
page 178. 

Using the +Oparaiiei option at +03 and above allows the compiler 
to automatically parallelize loops that are profitableto parallelize. 

Only loops with iteration counts that can be determined prior to loop 
invocation at runtime are candidates for parallelization. Loops with 
iteration counts that depend on values or conditions calculated within 
the loop cannot be parallelized by any means. 

• Specify task-level parallelism using thebegin_tasks, next_task 
and end_tasks directives and pragmas, as discussed in the section 
"Parallelizing tasks”section on page 192. 

• Specify parallel regions using the parallel and end_paraiiei 
directives and pragmas, as discussed in thesection "Parallelizing 
regions” section on page 197. These directives and pragmas allow the 
compiler to run identified sections of code in parallel. 

Loop-level parallelism 

HP compilers locate parallelism at the loop level, generating parallel 
codethat is automatically run on as many processors as are available at 
runtime. Normally, these are all the processors on thesamesystem 
where your program is running. You can specify a smaller number of 
processors using any of the following: 

• ioop_paraiiei (max_threads=m) directiveand pragma—available 
in Fortran and C 

• prefer_paraiiei (max_threads=m) directive and pragma— 
available in Fortran and C 
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Example 


For more information on the ioop_paraiiei and 
prefer_paraiiei directives and pragmas see Chapter 9, "Parallel 
programming techniques”. 

• mp_number_of_threads environment variable—This variable is 
read at runtime by your program. If this variable is set to some 
positive integer n, your program executes on n processors, n must be 
less than or equal to the number of processors in the system where 
the program is executing. 

Automatic parallelization 

Automatic parallelization is useful for programs containing loops. You 
can use compiler directives or pragmas to improve on the automatic 
optimizations and to assist the compiler in locating additional 
opportunities for parallelization. 

If you are writing your program entirely under the message-passing 
paradigm, you must explicitly handle parallelism as discussed in the 
HP MPI User’sGuide 

Loop-level parallelism 

This example begins with thefollowing Fortran code: 

PROGRAM PARAXPL 


DO I = 1, 1024 

A (I) = B (I) + C(I) 


ENDDO 

Assuming that the i loop does not contain any parallelization-inhibiting 
code, this program can be parallelized to run on eight processors by 
running 128 iterations per processor (1024 iterations divided by 8 
processors = 128 iterations each). One processor would run the loop for 
i =1 to 128. The next processor would run i =129 to256, and soon. The 
loop could similarly be parallelized to run on any number of processors, 
with each one taking its appropriate share of iterations. 

At a certain point, however, adding more processors does not improve 
performance. The compiler generates code that runs on as many 
processors as are available, but the dynamic selection optimization 
(described in the section "Dynamic selection" section on page 102) 
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Parallel optimization features 

Threads 


ensures that parallel code is executed only if it is profitable to do so. If 
the number of available processors does not evenly dividethe number of 
iterations, some processors perform fewer iterations than others. 


Threads 

Parallelization divides a program into threads. A thread is a singleflow 
of control within a process. It can bea uniqueflow of control that 
performs a specific function, or one of several instances of a flow of 
control, each of which is operating on a unique data set. 

On a V-Class server, parallel shared-memory programs run as a 
collection of threads on multiple processors. When a program starts, a 
separate execution thread is created on each system processor on which 
the program is running. All but one of these threads is then idle. The 
nonidlethread is known as thread 1, and this thread runs all of the 
serial code in the program. 

Spawn thread I Ds are assigned only to nonidle threads when they are 
spawned. This occurs when thread 1 encounters parallelism and "wakes 
up” other idle threads to execute the parallel code. Spawn thread I Ds are 
consecutive, ranging from 0 to N-l, where N is the number of threads 
spawned as a result of the spawn operation. This operation defines the 
current spawn context. The spawn context is the loop, task list, or region 
that initiates the spawning of the threads. Spawn thread I Ds are valid 
only within a given spawn context. 

This means that the idle threads are not assigned spawn thread I Ds at 
the time of their creation. When thread 1 encounters a parallel loop, 
task, or region, it spawns the other threads, signaling them to begin 
execution. The threads then become active, acquire spawn thread IDs, 
run until their portion of the parallel code is finished, and go idle once 
again, as shown in Figure 13. 

Machine loading does not affect the number of threads spawned, but it may 
affect the order in which the threads in a given spawn context complete. 
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Figure 13 One-dimensional parallelism in threads 

Threads 1 ' 

program paraxpl 0 idle idle idle idle idle idle idle 


DO 1=1,1024 


A (I)=B(I)+C(I) 


ENDDO 


spawn spawn spawn spawn spawn spawn spawn 
1 2 3 4 5 6 7 


i i i i i i i i 

=1,12 =129 =257 =385 =513 =641 =769 =897, 

,256 ,384 ,512 ,640 ,768 ,896 !0 24 


idle idle idle idle idle idle idle 


* Numbers shown re present spawn thread IDs 


Loop transformations 

Figure 13 above shows that various loop transformations can affect the 
manner in which a loop is parallelized. 

To implement this, the compiler transforms the loop in a manner similar 
to strip mining. However, unlike in strip mining, the outer loop is 
conceptual. Becausethe strips execute on different processors, there is 
no processor to run an outer loop I ike the one created in traditional strip 
mining. 

I nstead, the loop is transformed. The starting and stopping iteration 
values are variables that are determined at runtime based on how many 
threads are available and which thread is running the strip in question. 

Example Loop transformations 

Consider the previous Fortran example written for an unspecified 
number of iterations: 

DO I = 1, N 

A(I) = B (I) + C(I) 

ENDDO 
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The code shown in Figure 14 is a conceptual representation of the 
transformation the compiler performs on this example when it is 
compiled for parallelization, assuming that n >= NumThreads. 

For n < NumThreads, thecompiler uses n threads, assuming there is 
enough work in the loop to justify the overhead of parallelizing it. If 
NumThreads is not an integral divisor of n, some threads perform fewer 
iterations than others. 


Figure 14 


Conceptual strip mine for parallelization 


For each available thread do: 

DO I = ThrdID*(N/NumThreads)+1,ThrdID*(N/NumThreads)+N/NumThreads 
A(I) = B (I) + C (I) 

ENDDO 


NumThreads is the number of available threads. ThrdID is the ID 
number of the thread this particular loop runs on, which is between 0 
and NumThreads-i . A unique ThrdID is assigned to each thread, and 
the ThrdiDs are consecutive. So, for NumThreads = 8, as in Figure 13, 
8 loops would be spawned, with ThrdiDs =0 through 7. These 8 loops 
are illustrated in Figure 15. 
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Figure 15 Parallelized loop 


DO I = 

1, 128 

A(I) 

= B (I) + C(I) 

ENDDO 


Thread 0 

DO I = 

257, 384 

A(I) 

= B (I) + C(I) 

ENDDO 


Thread 2 

DO I = 

513, 640 

A (I) 

= B (I) + C(I) 

ENDDO 


Thread 4 

DO I = 

769, 896 

A (I) 

= B (I) + C(I) 

ENDDO 



Thread 6 


DO I = 

A (I) 

ENDDO 

129, 256 

= B (I) + C(I) 

Thread 1 

DO I = 

385, 512 

A (I) 

= B (I) + C(I) 

ENDDO 


Thread 3 

DO I = 

641, 768 

A (I) 

= B (I) + C(I) 

ENDDO 


Thread 5 

DO I = 

897, 1024 

A (I) 

= B (I) + C(I) 

ENDDO 



Thread 7 


NOTE The strip-based parallelism described here is the default. Stride-based 

parallelism is possible through use of the prefer_paraiiei and 
ioop_paraiiei compiler directives and pragmas. 

I n these examples, thedata being manipulated within the loop is disjoint 
so that no two threads attempt to write the same data item. I f two 
parallel threads attempt to update the same storage location, their 
actions must be synchronized. This is discussed further in "Parallel 
synchronization,” on page 243. 


Chapter 6 


99 











Table 11 


Parallel optimization features 

Idle thread states 


Idle thread states 

Idlethreads can be suspended or spin-waiting. Suspended threads 
release control of the processor while spin-waiting threads repeatedly 
check an encached global semaphore that indicates whether or not they 
have code to execute. This obviously prevents any other process from 
gaining control of the CPU andean severely degrade multi process 
performance. 

Alternately, waking a suspended thread takes substantially longer than 
activating a spin-waiting thread. By default, idlethreads spin-wait 
briefly after creation or a join, then suspend themselves if no work is 
received. 

When threads are suspended, HP-UX may schedule threads of another 
process on their processors in order to balance machine load. However, 
threads have an affinity for their original processors. HP-UXtriesto 
schedule unsuspended threads to their original processors in order to 
exploit the presence of any data encached during thethread's last 
timeslice. This occurs only if the original processor is available. 
Otherwise, thethread is assigned tothefirst processor to become 
available. 

Determining idle thread states 

U se the mp_idle_threads_wait environment variable to determine 
how threads wait. The form of the mp_idle_threads_wait 
environment variable is shown in Table 11. 


Form of mp_idle_threads_wait environment variable 


Language 

Form 

Fortran, C 

setenv MP_IDLE_THREADS_WAIT=n 
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where 

n is the integer value, represented in milliseconds, that 

the threads spin-wait. These have values as described 
below: 

• For n less than 0, thethreads spin-wait. 

• For n equal to or greater than 0, thethreads spin-wait for n 
milliseconds before being suspended. 

By default, idlethreads spin-wait briefly after creation or a join. They 
then suspend themselves if no work is received. 
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Parallel optimizations 

Simple loops can be parallelized without the need for extensive 
transformations. However, most loop transformations do enhance 
optimum parallelization. For instance, loop interchange orders loops so 
that the innermost loop best exploits the processor data cache, and the 
outermost loop is the most efficient loop to parallelize. 

Loop blocking similarly aids parallelization by maximizing cachedata 
reuse on each of the processors that the loop runs on. It also ensures that 
each processor is working on nonoverlapping array data. 

Dynamic selection 

The compiler has noway of determining how many processors are 
availableto run compiled code. Therefore, it sometimes generates both 
serial and parallel code for loops that are parallelized. Replicating the 
loop in this manner is called cloning, and the resulting versions of the 
loop arecalled clones. Cloning is also performed when the loop-iteration 
count is unknown at compile-time. 

It is not always profitable, however, to run the parallel clone when 
multiple processors are available. Some overhead isinvolved in 
executing parallel code. This overhead includes the time it takes to 
spawn parallel threads, to privatizeany variables used in the loop that 
must be privatized, and tojoin the parallel threads when they complete 
their work. 

Workload-based dynamic selection 

HP compilers use a powerful form of dynamic selection known as 
workload-based dynamic selection. When a loop’s iteration count is 
available at compile time, workload-based dynamic selection determines 
the profitability of parallelizing the loop. It only writes a parallel version 
to the executable if it is profitable to do so. 

If the parallel version will not be needed, the compiler can omit it from 
the executable to further enhance performance. This eliminates the 
runtime decision as to which version to use. 

The power of dynamic selection becomes more apparent when the loop's 
iteration count is unknown at compiletime. I n this case, thecompiler 
generates code that, at runtime, compares the a mount of work performed 
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in the loop nest (given the actual iteration counts) to the parallelization 
overhead for the available number of processors. It then runs the parallel 
version of the loop only if it is profitable to do so. 

When specified with +Oparaiiei at +03, workload-based dynamic 
selection is enabled by default. The compiler only generates a parallel 
version of the loop when +Onodynsei is selected, thereby disabling 
dynamic selection. When dynamic selection is disabled, the compiler 
assumes that it is profitableto parallelize all paralleiizable loops and 
generates both serial and parallel clones for them. I n this case the 
parallel version is run if there are multiple processors at runtime, 
regardless of the profitability of doing so. 

dynsel, no_dynsel 

The dynsel and no_dynsei directives are used to specify dynamic 
selection for specific loops in programs compiled using the +Onodynsei 
option or to provide trip count information for specific loops in programs 
compiled with dynamic selection enabled. 

To disable dynamic selection for selected loops by using theno_dynsei 
compiler directive or pragma. This directive or pragma is used todisable 
dynamic selection on specific loops in programs compiled with dynamic 
selection enabled. 

The form of these directives and pragmas are shown in Table 12. 


Table 12 Form of dynsel directive and pragma 


Language 

Form 

Fortran 

C$DIR DYNSEL [(THREAD_TRIP_COUNT = n)] 


C$DIR NO_DYNSEL 

C 

#pragma _CNX dynsel [(thread_trip_count = n )] 


#pragma _CNX no_dynsel 
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where 

thread_trip_count 

is an optional attribute used to specify threshold 
iteration counts. 

When thread_trip_count = n is specified, the 
serial version of the loop is run if the iteration count is 
less than n. Otherwise, thethread-parallel version is 
run. 

If a trip count is not specified for a dynsei directive or 
pragma, the compiler uses a heuristic to estimate the 
actual execution costs. This estimate is then used to 
determine if it is profitableto executethe loop in 
parallel. 

As with all optimizations that replicate loops, the number of new loops 
created when thecompiler performs dynamic selection is limited by 
default to ensure reasonable code sizes. To increase the replication limit 
(and possibly increase your compile time and code size), specify the 
+Onosize +Onoiimit compiler options. These are described in 
"Controlling optimization,"on page 113. 


104 


Chapter6 




Parallel optimization features 

Inhibiting parallelization 


Example 


Example 


Inhibiting parallelization 

Certain constructs, such as loop-carried dependences, inhibit 
parallelization. Other types of constructs, such as procedure cal Is and I/O 
statements, inhibit parallelism for the same reason they inhibit 
localization. An exception to this is that more categories of loop-carried 
dependences can inhibit parallelization than data localization. This is 
described in the following sections. 

Loop-carried dependences (LCDs) 

The specific loop-carried dependences (LCDs) that inhibit data 
localization represent a very small portion of all loop-carried 
dependences. A much broader set of LCDs inhibits parallelization. 
Examples of various parallel-inhibiting LCDs follows. 

Parallel-inhibiting LCDs 

Onetype of LCD exists when one iteration references a variable whose 
value is assigned on a later iteration. The Fortran loop below contains 
this type of LCD on the array a. 

DO I = 1, N - 1 

A (I) = A(I + 1) + B (I) 

ENDDO 

In thisexample, thefirst iteration assigns a valuetoA(i) and 
references a ( 2) . The second iteration assigns a value toA (2) and 
references a (3). The reference to a (i) depends on the fact that the 
i + ith iteration, which assigns a new valuetoA(i), has not yet 
executed. 

Forward LCDs inhibit parallelization because if the loop is broken up to 
run on several processors, when i reaches its terminal value on one 
processor, a (i + i) has usually already been computed by another 
processor. It is, in fact, thefirst value computed by another processor. 
Becausethecalculation depends on a (i + i) not being computed yet, this 
would produce wrong answers. 

Parallel-inhibiting LCDs 

Another type of LCD exists when one iteration references a variable 
whose value was assigned on an earlier iteration.The Fortran loop below 
contains a backward LCD on the array a. 
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Example 


DO I = 2, N 

a (i) = A (: ) + b (i) 

ENDDO 

Here, each iteration assigns a value to a based on the value assigned to a 
in the previous iteration. If a(i-1) has not been computed beforeA(i) 
is assigned, wrong answers result. 

Backward LCDs inhibit parallelism because if the loop is broken up to 
run on several processors, a (i-i) are not computed for the first 
iteration of the loop on every processor except the processor running the 
chunk of the loop containing i = l. 

Output LCDs 

An output LCD exists when the same memory location is assigned values 
on two or more iterations. A potential output LCD exists when the 
compiler cannot determine whether an array subscript contains the 
same values between loop iterations. 

The Fortran loop below contains a potential output LCD on the array a: 

DO I = 1, N 

A (J (I) ) = B (I) 

ENDDO 

Here, if any referenced elements of j contain the same value, the same 
element of a is assigned several different elements of b. I n this case, as 
this loop is written, any a elements that are assigned more than once 
should contain the final assignment at the end of the loop. This cannot be 
guaranteed if the loop is run in parallel. 

Apparent LCDs 

The compiler chooses to not parallelize loops containing apparent LCDs 
rather than risk wrong answers by doing so. 

If you are sure that a loop with an apparent LCD is safe to parallelize, 
you can indicatethis to the compiler using the no_ioop_dependence 
directive or pragma, which is explained in the section "Loop-carried 
dependences (LCDs)” section on page 59. 

The following Fortran example illustrates a no_loop_dependence 
directive being used on theoutput LCD example presented previously: 

C$DIR NO_LOOP_DEPENDENCE(A) 

DO I = 1, N 

A(J(I)) = B(I) 

ENDDO 
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This effectively tells the compiler that no two elements of j are identical, 
so there is no output LCD and the loop is safe to parallelize. If any of the 
j values are identical, wrong answers could result. 
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Reductions 

In many cases, the compiler can recognize and parallelize loops 
containing a special class of dependence known as a reduction, I n 
general, a reduction has the form: 

x = x operator y 

where 

x is a variable not assigned or used elsewhere in the loop, 

y is a loop constant expression not involving x, and 
operator is +, *, .and., .or., or .xor. 

The compiler also recognizes reductions of theform: 

x = function (x, y) 

where 

x is a variable not assigned or referenced elsewhere in 

the loop, y is a loop constant expression not involving x, 
and function is the intrinsic max function or intrinsic 
min function. 

Generally, thecompiler automatically recognizes reductions in a loop and 
is able to parallelize the loop. If the loop is under the influence of the 
pref er_paraiiei directive or pragma, thecompiler still recognizes 
reductions. 

However, in a loop being manipulated by the ioop_paraiiei directive 
or pragma, reduction analysis is not performed. Consequently, the loop 
may not be correctly parallelized unless the reduction is enforced using 
the reduction directive or pragma. 

Theform of this directive and pragma is shown in Table 13. 


Form of reduction directive and pragma 


Language 

Form 

Fortran 

C$DIR REDUCTION 

C 

#pragma _CNX reduction 


Reduction 
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Reductions commonly appear in the form of sum operations, as shown in 
the foil owing Fortran example: 

DO I = 1, N 

A(I) = B (I) + C(I) 


ASUM = ASUM + A(I) 

ENDDO 

Assuming this loop does not contain any parallelization-inhibiting code, 
the compiler would automatically parallelize it. Thecode generated to 
accomplish this creates temporary, thread-specific copies of asum for 
each thread that runs the loop. When each parallel thread completes its 
portion of the loop, thread 0 for the current spawn context accumulates 
the thread-specific values into the global asum. 

Thefollowing Fortran example shows the use of the reduct ion directive 
on the above code. ioop_paraiiei is described on on page 179. 
ioop_private is described on on page 220. 

C$DIR LOOP_PARALLEL, LOOP_PRIVATE(FUNCTEMP), REDUCTION(SUM) 

DO I = 1, N 


FUNCTEMP = FUNC(X(I)) 
SUM = SUM + FUNCTEMP 


ENDDO 
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Preventing parallelization 

You can prevent parallelization on a loop-by-loop basis using the 
no_paraiiei directive or pragma. The form of this directiveand 
pragma is shown in Table 14. 

Form of no parallel directive and pragma 


Language 

Form 

Fortran 

C$DIR NO_PARALLEL 

C 

#pragma _CNX no_parallel 


Use these directives to prevent parallelization of the loop that 
immediately follows them. Only parallelization is inhibited; all other 
loop optimizations are still applied. 

no parallel 

The following Fortran example illustrates the use of no_paraiiei: 

DO I = 1, 1000 
C$DIR NO_PARALLEL 

DO J = 1, 1000 
A (I, J) = B (I, J) 

ENDDO 

ENDDO 

I n this example, parallelization of the j loop is prevented. The i loop can 
still be parallelized. 

The +Onoautopar compiler option is availabletodisableautomatic 
parallelization but allows parallelization of directive-specified loops. 
Refer to "Controlling optimization," on page 113, and "Parallel 
programming techniques," on page 175, for more information on 

+Onoautopar. 
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Parallelism in the aC++ compiler 

Parallelism in the aC-H-compiler is availablethrough the use of the 
following command-line options or libraries: 

• +03 +Oparaiiei or +04 +Oparaiiei optimization options— 
Automatic parallelization is availablefrom the compiler; seethe 
section "Levels of parallelism" section on page 94 for more 
information. 

• HP MPI—HP’s implementation of the message-passing interface; see 
the HP MPI User’s Guide for more information. 

• Pthreads (POSIX threads)— Seethe pthread(3t) man page or the 
manual Programming with Threads on H P-UX for more information. 

None of the pragmas described in this book are currently available in the 
HP aC-H-compiler. However, aC-H-does support the memory classes 
briefly explained in "Controlling optimization,"on page 113, and more 
specifically in "Memory classes," on page 233. These classes are 
implemented through the storage cl ass specifiers node_private and 
thread_private. 
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Cloning across multiple source files 

Cloning is the replacement of a call to a routine by a call to a cl one of that 
routine. The clone is optimized differently than the original routine. 
Cloning can expose additional opportunities for interprocedural 
optimization. 

Cloning at +04 is performed across all procedures within the program. 
Cloning at +03 isdone within onefile. Cloning isenabled by default. It is 
disabled by specifying the +Onoiniine command-line option. 
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NOTE 


The H P-UX compiler set includes a group of optimization controls that 
are used to improve code performance. These controls can be invoked 
from either the command line or from within a program using certain 
directives and pragmas. 

This chapter includes a discussion of the foil owing topics: 

• Command-line optimization options 

• I nvoking command-line options 

• C aliasing options 

• Optimization directives and pragmas 

Refer to Chapter 3, "Optimization levels” for information on coding 
guidelines that assist the optimizer. Seethef90(l), cc(l), and aCC(l) 
man pages for information on compiler options in general. 

The HP aC++ compiler does not support the pragmas described in this 
chapter. 
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Table 15 


Command-line optimization options 

This section lists the command-line optimization options availablefor 
use with the HP C, C++, and Fortran compilers. Table 15 describes the 
options and the optimization levels at which they are used. 


Command-line optimization options 


Optimization options 

Valid 

optimization 

levels 

Command-line options 

+0[no]aggressive 

+02, +03, +04 

+0[no]all 

all 

+0[no]autopar 

(must be used with the +Oparallel option at +03 or 
above) 

+03, +04 

+0[no]conservative 

+02, +03, +04 

+0[no]dataprefetch 

+02, +03, +04 

+0[no]dynsel 

(must be used with the +Oparallel option at +03 or 
above) 

+03, +04 

+0[no]entrysched 

+01, +02, +03, 

+04 

+0[no]fail_safe 

+01, +02, +03, 

+04 

+0[no]fastaccess 

all 

+0[no]fltacc 

+02, +03, +04 

+0[no]global_ptrs_unique[ =namelist ] 

(C only) 

+02, +03, +04 

+0[no]info 

all 
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Optimization options 

Valid 

optimization 

levels 

+0[no]initcheck 

+02, +03, +04 

+0[no]inline[ =namelist ] 

+03, +04 

+Oinline_budget=n 

+03, +04 

+0[no]libcalls 

all 

+0[no]limit 

+02, +03, +04 

+0[no]loop_block 

+03, +04 

+0[no]loop_transform 

+03, +04 

+0[no]loop_unroll [=unroll_factor] 

+02, +03, +04 

+0[no]loop_unroll_jam 

+03, +04 

+0[no]moveflops 

+02, +03, +04 

+0[no]multiprocessor 

+02, +03, +04 

+0[no]parallel 

+03, +04 

+0[no]parmsoverlap 

+02, +03, +04 

+0[no]pipeline 

+02, +03, +04 

+O[no]procelim 

all 

+0[no]ptrs_ansi 

+02, +03, +04 

+0[no]ptrs_strongly_typed 

+02, +03, +04 

+0[no]ptrs_to_globals [=namelist] 

(C only) 

+02, +03, +04 

+0[no]regreassoc 

+02, +03, +04 

+0[no]report[ =report_type ] 

+03, +04 

+0[no]sharedgra 

+02, +03, +04 
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Optimization options 

Valid 

optimization 

levels 

+0[no]signedpointers 

(C/C++ only) 

+02, +03, +04 

+0[no]size 

+02, +03, +04 

+0[no]static_prediction 

all 

+0[no]vectorize 

+03, +04 

+0[no]volatile 

+01, +02, +03, 

+04 

+0[no]whole_program_mode 

+04 
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Invoking command-line options 

At each optimization level, you can turn specific optimizations on or off 
using the +o [no] optimization option. The optimization parameter is the 
name of a specific optimization. The optional prefix [no] disables the 
specified optimization. 

The following sections describe the optimizations that are turned on or 
off, their defaults, and the optimization levels at which they may be used. 
I n syntax descriptions, namelist represents a comma-separated list of 
names. 

+0[no]aggressive 
Optimization level: +02, +03, +04 
Default: +Onoaggressive 

+0 [no] aggressive enables or disables optimizations that can result in 
significant performance improvement, and can change a program's 
behavior. This includes the optimizations invoked by the foil owing 
advanced options (these are discussed separately in this chapter): 

• +Osignedpointers (C and C++) 

• +Oentrysched 

• +Onofltacc 

• +01ibcalls 

• +Onoinitcheck 

• +Ovectorize 
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+0[no]all 
Optimization level: all 
Default: +Onoall 

Equivalent option: +Oaii option is equivalent to specifying +o4 

+Oaggressive +Onolimit 

+Oaii performs maximum optimization, including aggressive 
optimizations and optimizations that can significantly increase compile 
time and memory usage. 

+0[no]autopar 

Optimization level: +03, +04 (+Oparaiiei must also be specified) 

Default: +Oautopar 

When used with +Oparaiiei option, +Oautopar causes the compiler to 
automatically parallelize loops that are safe to parallelize. A loop is 
considered safe to parallelize if its iteration count can be determined at 
runtime before loop invocation. It must also contain no loop-carried 
dependences, procedure calls, or I/O operations. 

A loop-carried dependence exists when one iteration of a loop assigns a 
valuetoan address that is referenced or assigned on another iteration. 

When used with +oparaiiei, the +Onoautopar option causes the 
compiler to parallelize only those loops marked by the ioop_paraiiei 
or prefer_paraiiei directives or pragmas. Because the compiler does 
not automatically find parallel tasks or regions, user-specified task and 
region parallelization is not affected by this option. 

C pragmas and Fortran directives are used to improve the effect of 
automatic optimizations and to assist the compiler in locating additional 
opportunities for parallelization. See "Optimization directives and 
pragmas” section on page 146 for more information. 
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+0[no]conservative 
Optimization level: +02, +03, +04 
Default: +Onoconservative 

Equivalent option: +Oconservative is equivalent to 

+Onoaggressive 

+0 [no] conservative causes the optimizer to make or not make 
conservative assumptions about the code when optimizing. 
+Oconservative is useful in assuming a particular program's coding 
style, such as whether it is standard-compliant. Specifying 
+Onoconservative disables any optimizations that assume 
standard-compliant code. 

+0[no]dataprefetch 
Optimization level: +02, +03, +04 
Default: +Onodataprefetch 

When +odataprefetch is used, theoptimizer inserts instructions 
within innermost loops toexplicitly prefetch data from memory into the 
data cache. For cache lines containing data to be written, 

+odatapre fetch prefetches the cache lines so that they are valid for 
both read and write access. Data prefetch instructions are inserted only 
for data referenced within innermost loops using simple loop-varying 
addresses in a simple arithmetic progression. It is only avail able for 
PA-RI SC 2.0 targets. 

The math library libm contains special prefetching versions of vector 
routines. If you have a PA-RI SC 2.0 application containing operations on 
arrays larger than one megabyte in size, using +Ovectorize in 
conjunction with +odataprefetch may substantially improve 
performance. 

You can also use the +odatapref etch option for applications that have 
high data cache miss overhead. 
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+0[no]dynsel 

Optimization level: +03, +04 (+Oparaiiei must also be specified) 

Default: +Odynsel 

When specified with +oparaiiei, +odynsei enables workload-based 
dynamic selection. For parallelizable loops whose iteration counts are 
known at compiletime, +odynsei causes thecompiler to generate either 
a parallel or a serial version of the loop—depending on which is more 
profitabl e. 

This optimization also causes the compiler to generate both parallel and 
serial versions of parallelizable loops whose iteration counts are 
unknown at compiletime. At runtime, the loop's workload is compared to 
parallelization overhead, and the parallel version is run only if it is 
profitable to do so. 

The +Onodynsei option disables dynamic selection and tells the 
compiler that it is profitableto parallelize all parallelizable loops. The 
dynsel directive and pragma are used to enable dynamic selection for 
specific loops, when +Onodynsei is in effect. Seethe section "Dynamic 
selection” section on page 102 for additional information. 

+0[no]entrysched 
Optimization level: +oi, + 02 , +03, +04 
Default: +Onoentrysched 

+Oentrysched optimizes instruction scheduling on a procedure's entry 
and exit sequences by unwinding in the entry and exit regions. 
Subsequently, this option is used to increase the speed of an application. 

+o [no] entrysched can also change the behavior of programs 
performing exception-handling or that handle asynchronous interrupts. 
The behavior of set jmp o and long jmp() is not affected. 
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+0[no]fail_safe 
Optimization level: + 01 , + 02 , +03, +04 
Default: +0fail_safe 

+of aii_safe allows your compilations to continue when internal 
optimization errors are detected. When an error is encountered, this 
option issues a warning message and restarts the compilation at + 00 . 
The +of aii_safe option is disabled when you specify +Oparaiiei with 
+03 or +04 tocompilewith parallelization. 

Using +Onofaii_safe aborts your compilation when internal 
optimization errors are detected. 

+0[no]fastaccess 

Optimization level: + 00 , + 01 , + 02 , +03, +04 

Default: +Onofastaccess at + 00 , + 01 , +02 and +03; 

+Ofastaccess at +04 

+of astaccess performs optimization for fast access to global data 
items. Use +ofastaccess to improve execution speed at the expense of 
longer compile times. 

+0[no]fltacc 
Optimization level: + 02 , +03, +04 
Default: none (SeeTable 16.) 

+0 [no] f ltacc enables or disables optimizations that cause imprecise 
floating-point results. 

+of ltacc disables optimizations that cause imprecise floating-point 
results. Specifying +of ltacc disables the generation of Fused 
Multiply-Add (FMA) instructions, as well as other floating-point 
optimizations. Use +ofitacc if it is important that thecompiler 
evaluates floating-point expressions according totheorder specified by 
the language standard. 

+Onof ltacc improves execution speed at the expense of floating-point 
precision. The +Onofitacc option allows thecompiler to perform 
floating-point optimizations that are algebraically correct, but may 
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result in numerical differences. These differences are generally 
insignificant. The +Onof itacc option also enables the optimizer to 
generateFMA instructions. 

If you optimize code at +02 or higher, and do not specify +Onof itacc or 
+of itacc, the optimizer uses FMA instructions. Flowever, it does not 
perform floating-point optimizations that involve expression reordering. 
FMA is implemented by thePA-8x00 instructions fmpyfadd and 
fmpynfadd and improves performance. Occasionally, these instructions 
may produce results that may differ in accuracy from results produced by 
code without FMA. In general, the differences are slight. 

Table 16 presents a summary of the preceding information. 


+0 [no] f itacc and floating-point optimizations 


Option specified 8 

FMA optimizations 

Other floating¬ 
point optimizations 

+0fItacc 

Disabled 

Disabled 

+OnofItacc 

Enabled 

Enabled 

neither option 
is specified 

Enabled 

Disabled 


a. +0 [no] f itacc is only available at +02 and above. 


+0[no]global_ptrs_unique[ =namelist ] 
Optimization level: +02, +03, +04 
Default: +Onoglobal_ptrs_unique 

This option is not available in Fortran or C++. 

Using this C compiler option identifies uniqueglobal pointers sothat the 
optimizer can generate more efficient code in the presence of unique 
pointers, such as using copy propagation and common subexpression 
elimination. A global pointer is unique if it does not alias with any 
variable in theentire program. 

This option supports a comma-separated list of uniqueglobal pointer 
variable names, represented by namelist in 

+0 [no] giobai_ptrs_unique [=namelist]. If namelist is not specified, 
using +0 [no] giobai_ptrs_unique informs the compiler that all [no] 
global pointers are unique. 


122 


Chapter7 




Controlling optimization 

Invoking command-line options 


The example below states that no global pointers are unique, except a 
and b: 

+Oglobal_ptrs_unique=a, b 

The next example says that all global pointers are unique except a and b: 

+Onoglobal_ptrs_unique=a, b 


+0[no]info 

Optimization level: +oo, + 01 , + 02 , +03, +04 

Default: +Onoinfo 

+oinf o displays informational messages about the optimization process. 
This option is used at all optimization levels, but is most useful at +03 
and +04. For more information about this option, see Chapter 8, 
"Optimization Report”on page 113. 

+0[no]initcheck 
Optimization level: + 02 , +03, +04 
Default: unspecified 

+0 [no] initcheck performs an initialization check for theoptimizer. 
The optimizer has three possible states that check for initialization: on, 
off, or unspecified. 

• When on (toinitcheck), theoptimizer initializes to zero any local, 
scalar, and nonstatic variables that are uninitialized with respect to 
at least one path leading to a use of the variable. 

• When off (+Onoinitcheck), theoptimizer issues warning messages 
when it discovers definitely uninitialized variables, but does not 
initializethem. 

• When unspecified, the optimizer initializes to zero any local, scalar, 
nonstatic variables that are definitely uninitialized with respect toall 
paths leading to a use of the variable. 
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NOTE 


+o [no] inline [=namelist] 

Optimization level: + 03 , +04 
Default: +Oinline 

When +oiniine is specified without a name list, any function can be 
inlined. For successful inlining, follow the prototype definitions for 
function calls in the appropriate header files. 

When specified with a name list, the named functions are important 
candidates for inlining. For example, the following statement indicates 
that inlining be strongly considered for foo and bar: 

+Oinline=foo,bar +Onoinline 

All other routines are not considered for inlining because +Onoiniine is 
given. 

The Fortran and aC++ compilers accept only +o [no] inline. No namelist 
values are accepted. 

Use the +Onoiniine [= namelist ] option to exercise precise control 
over which subprograms are inlined. Use of this option is guided by 
knowledge of the frequency with which certain routines are cal led and 
may be warranted by code size concerns. 

When this option is disabled with a name list, the compiler does not 
consider the specified routines as candidates for inlining. For example, 
the foil owing statement indicates that inlining should not be considered 
for baz and x: 

+Onoinline=baz,x 

All other routines areconsidered for inlining because +oiniine is the 
default. 
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+Oinline__budget=n 
Optimization level: +03, +04 
Default: +Oinline_budget = 100 

I n +oiniine_budget=n, n is an integer in the range 1 to 1000000 that 
specifies the level of aggressiveness, as follows: 

n=100 Default level of inlining 

n>100 Moreaggressiveinlining 

Theoptimizer is less restricted by compilation timeand 
code size when searching for eligible routines to inline 

n=l Only inline if it reduces code size 

The +Onoiimit and +osize options also affect inlining. Specifying the 
+Onolimit option implies specifying +Oinline_budget=200. The 
+Osize option implies +oiniine_budget=i. However, 
+oiniine_budget takes precedence over both of these options. This 
means that you can override the effects on inlining of the +Onoiimit 
and +Osize options, by specifying the +oiniine_budget option on the 
same command line. 

+0[no]libcalls 

Optimization level: +oo, +oi, + 02 , +03, +04 

Default: +onoiibcaiis at +oo and +oi; 

+oiibcaiis at + 02 , +03, and +04 

+oiibcaiis increases the runtime performance of code that calls 
standard library routines in simple contexts. The +oiibcaiis option 
expands the following library calls inline: 

• strcpy() 

• sqrt() 

• fabs() 

• alloca () 

I nlining takes pi ace only if the function call follows the prototype 
definition in the appropriate header file. A single call toprintf () may 
be replaced by a series of calls toputchar o . Calls to sprintf () and 
strien () may be optimized more effectively, including elimination of 
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some cal Is producing unused results. Calls to set jmp () and longjmp () 
may be replaced by their equivalents _setjmp o and _iongjmp(), 
which do not manipulate the process’s signal mask. 

Using the +oiibcaiis option invokes millicode versions of frequently 
called math functions. Currently, there are millicode versions for the 
following functions: 


acos 

asin 

atan 

atan2 

cos 

exp 

log 

loglO 

pow 

sin 

tan 



Seethe HP-UX Floating-Point Guidefor the most up-to-date listing of 
the math library functions. 

+oiibcaiis also improves the performance of selected library routines 
(when you are not performing error checking for these routines). The 
calling code must not expect to access errno after thefunction's return. 

Using +oiibcaiis with +ofitacc gives different floating-point 
calculation results than those given using +oiibcaiis without 
+Of ltacc. 


+0[no]limit 
Optimization level: + 02 , +03, +04 
Default: +oiimit 

The +oiimit option suppresses optimizations that significantly increase 
compile-time or that can consume a considerable amount of memory. 

The +Onoiimit option allows optimizations to be performed, regardless 
of their effects on compile-time and memory usage. Specifying the 
+Onolimit option implies specifying +0inline_budget=200. Seethe 
section "+oiniine_budget=n" on page 125 for more information. 
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+0[no]loop_block 
Optimization level: +03, +04 
Default: +Onoloop_block 

+o [no] ioop_biock enables or disables blocking of eligible loops for 
improved cache performance. The +Onoioop_biock option disables both 
automatic and directive-specified loop blocking. For more information on 
loop blocking, seethe section "Loop blocking" section on page 70. 

+0[no]loop_transform 
Optimization level: +03, +04 
Default: +01oop_transform 

+o [no] ioop_transform enables or disables transformation of eligible 
loops for improved cache performance. The most important 
transformation is the interchange of nested loops to make the inner loop 
unit stride, resulting in fewer cache misses. 

Theother transformations affected by +o [no] ioop_transform are loop 
distribution, loop blocking, loop fusion, loop unroll, and loop unroll and 
jam. See "Optimization levels,” on page 25 for information on loop 
transformations. 

If you experience any problem while using +oparaiiei, 
+Onoioop_transf orm may be a helpful option. 


+o [no] ioop_unroii [=unroll factor] 

Optimization level: + 02 , +03, +04 
Default: +01oop_unroll = 4 

+Oioop_unroii enables loop unrolling. When you use +Oioop_unroii, 
you can also suggest the unroll factor to control the code expansion. The 
default unroll factor is four, meaning that the loop body is replicated four 
times. By experimenting with different factors, you may improve the 
performance of your program. I n some cases, the compiler uses its own 
unroll factor. 
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The +Onoioop_unroii option disables partial and complete unrolling. 
Loop unrolling improves efficiency by eliminating loop overhead, and can 
create opportunities for other optimizations, such as improved register 
use and more efficient scheduling. See the section "Loop unrolling" 
section on page45 for more information on unrolling. 

+0[no]loop_unroll_jam 
Optimization level: +03, +04 
Default: +Onoloop_unroll_jam 

The +o [no] ioop_unroii_jam option enables or disables loop unrolling 
and jamming. The +Onoioop_unroii_jam option (thedefault) disables 
both automatic and directive-specified unroll and jam. Loop unrolling 
and jamming increases register exploitation. For more information on 
the unroll and jam optimization, seethe section "Loop unroll and jam" 
section on page 84. 

+0[no]moveflops 
Optimization level: + 02 , +03, +04 
Default: +Omovef lops 

+o [no] movef lops allows or disallows moving conditional floating-point 
instructions out of loops. The behavior of floating-point exception 
handling may be altered by this option. 

Use +Onomove flops if floating-point traps areenabled and you do not 
want the behavior of floating-point exceptions to be altered by the 
relocation of floating-point instructions. 
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+0[no]multiprocessor 
Optimization level: + 02 , +03, +04 
Default: +Onomultiprocesssor 

Specifying the +Omuitiprocessor option at +02 and above tells the 
compiler to appropriately optimize several different processes on 
multiprocessor machines. The optimizations are those appropriate for 
executables and shared libraries. 

Enabling this option incorrectly (such as on a uniprocessor machine) may 
cause performance problems. 

Specifying +Onomuitiprocessor (thedefault) disables the optimization 
of more than one process running on multiprocessor machines. 

+0[no]parallel 
Optimization level: +03, +04 
Default: +Onoparallel 

The +Onoparaiiei option is the default for all optimization levels. This 
option disables automatic and directive-specified parallelization. 

If you compile one or more files in an application using +oparaiiei, 
then the application must belinked (using the compiler driver) with the 
+Oparaiiei option to link in the proper start-up files and runtime 
support. 

The +Oparaiiei option causes the compiler to: 

• Recognize the directives and pragmas that involve parallelism, such 

as begin_tasks, loop_parallel, and prefer_parallel 

• Look for opportunities for parallel execution in loops 

The foil owing methods are used to specify the number of processors used 
in executing your parallel programs: 

• ioop_paraiiei (max_threads=m) directive and pragma 

• prefer_paraiiei (max_threads=m) directive and pragma 
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For a description of these directives and pragmas, see "Parallel 
programming techniques," on page 175 and "Parallel 
synchronization," on page 243. These pragmas are not available in 
the HP a C++ compiler. 

• mp_number_of_threads environment variable, which is read at 
runtime by your program. If this variable is set to some positive 
integer n, your program executes on n processors, n must be less than 
or equal to the number of processors in the system wherethe program 
is executing. 

The +oparaiiei option is valid only at optimization level +03 and 
above. For information on parallelization, see the section "Levels of 
parallelism"section on page94. 

Using the +Oparaiiei option disables +ofaii_safe, which is enabled 
by default. Seethesection "+o [no] faii_safe” on page 121 for more 
information. 

+0[no]parmsoverlap 
Optimization level: +02, +03, +04 
Default (Fortran): +Onoparmsoverlap 
Default (C/C++): +Oparmsoverlap 

+Oparms over lap causes the optimizer to assume that the actual 
arguments of function calls overlap in memory. 

+0[no]pipeline 
Optimization level: +02, +03, +04 
Default: +Opipeline 

+o [no] pipeline enables or disables software pipelining. If program 
size is more important than execution speed, use +Onopipeiine. 

Software pipelining is particularly useful for loops containing arithmetic 
operations on real or real*8 variables in Fortran or on float or 
double variables in C and C++. 
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NOTE 


+0[no]procelim 

Optimization level: +oo, + 01 , + 02 , +03, +04 

Default: +Onoprocelim at +00, +01, +02, +03; 

+Oprocelim at +04 

When +Oproceiim is specified, procedures not referenced by the 
application are eliminated from theoutput executable file. The 
+Oproceiim option reduces the size of the executable file, especially 
when optimizing at +03 and +04, at which inlining may have removed all 
of the calls to some routines. 

When +Onoproceiim is specified, procedures not referenced by the 
application are not eliminated from theoutput executable file. 

If the+Oaii option is enabled, the +Oproceiim option is enabled. 

+0[no]ptrs_ansi 
Optimization level: + 02 , +03, +04 
Default: +Onoptrs_ansi 

The +Optrs_ansi option makes thefollowing two assumptions, which 
the more aggressive +Optrs_strongiy_typed does not: 

• int *p is assumed to point to an int field of a struct or union. 

• char * is assumed to point to any type of object. 

This option is not available in C++. 

When both +Optrs_ansi and +Optrs_strongiy_typed are specified, 
+Optrs_ansi takes precedence. 
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+0[no]ptrs_strongly_typed 
Optimization level: + 02 , +03, +04 
Default: +Onoptrs_strongly_typed 

UsetheC compiler option +Optrs_strongiy_typed when pointers are 
type-safe. Theoptimizer can use this information to generate more 
efficient code. 

This option is not available in C++. 

Type-safe (strongly-typed) pointers point to a specific type that, in turn, 
only point to objects of that type. For example, a pointer declared as a 
pointer to an int is considered type-safe if that pointer points to an 
object of type int only. 

Based on the type-safe concept, a set of groups are built based on object 
types. A given group includes all the objects of the same type. 

I n type-inferred aliasing, any pointer of a type in a given group (of 
objects of the same type) can only point to any object from the same 
group. 11 cannot point to a typed object from any other group. 

Type casting toa different type violates type-inferring aliasing rules. 
Dynamic casting is, however, allowed, as shown in Example41. 

Data type interaction 

Theoptimizer generally spills all global data from registers to memory 
before any modification to global variables or any loads through pointers. 
However, theoptimizer can generate more efficient code if it knows how 
various data types interact. 
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Example 


Consider the following example (line numbers are provided for 
reference): 

1 int *p; 

2 float *q; 

3 int a,b,c; 

4 float d, e,f; 

5 foo () 

6 { 

7 for (1=1;i<10;iff) { 


8 

0) 

II 

TS 

9 

*p=. .. 

10 

e=d+f 

11 

f=*q; 

12 } 


13 } 



With +Onoptrs_strongiy_typed turned on, the pointers p and q are 
assu med to be disjoi nt because the types they poi nt to are different types. 
Without type-inferred aliasing, *p is assumed to invalidate all the 
definitions. So, the use of d and f on line 10 have to be loaded from 
memory. With type-inferred aliasing, the optimizer can propagate the 
copy of d and f, thus avoiding two loads and two stores. 

This option is used for any application involving the use of pointers, 
where those pointers are type safe. To specify when a subset of types are 
type-safe, usetheptrs_strongiy_typed pragma. The compiler issues 
warnings for any incompatible pointer assignments that may viol ate the 
type-inferred aliasing rules discussed in the section "C aliasing options" 
section on page 143. 

Unsafe type cast 

Any typecast to a different type violates type-inferred aliasing rules. Do 
not use +Optrs_strongiy_typed with code that has these "unsafe" 
typecasts. Use the no_ptrs_strongiy_typed pragma to prevent the 
application of type-inferred aliasing to the unsafe type casts. 

struct foo{ 

int a; 
int b; 

} *P; 

struct bar { 
float a; 
int b; 
float c; 

} *q; 

P = (struct foo *) q; 

/* Incompatible pointer assignment 
through type cast */ 
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Generally applying type aliasing 

Dynamic casting is allowed with +Optrs_strongiy_typed or 
+Optrs_ansi. A pointer dereference is called a dynamic cast if a cast is 
applied on the pointer to a different type. 

I n the example below, type-inferred aliasing is generally applied on p, 
not just to the particular dereference. Type-aliasing is applied to any 
other dereferences of p. 

struct s { 

short int a; 
short int b; 
int c; 

} *P 

* (int *)P = 0; 

For more information about type aliasing, see the section "C aliasing 
options” section on page 143. 
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NOTE 


+0 [no] ptrs_to_globals [ = namelist] 

Optimization level: + 02 , +03, +04 
Default: +Optrs_to_globals 

By default, global variables are conservatively assumed to be modified 
anywhere in the program. UsetheC compiler option 
+Onoptrs_to_giobais to specify which global variables are not 
modified through pointers. This allows the optimizer to make the 
program run more efficiently by incorporating copy propagation and 
common subexpression elimination. 

This option is not available in C++. 

This option is used to specify all global variables that are not modified 
using pointers, or to specify a comma-separated list of global variables 
that are not modified using pointers. 

The on state for this option disables some optimizations, such as 
aggressive optimizations on the program's global symbols. 

For example, use the command-line option 

+Onoptrs_to_giobais=a, b, c to specify global variables a, b, and c to 
not be accessible through pointers. The result (shown below) isthat no 
pointer can access these global variables. The optimizer performs copy 
propagation and constant folding because storing to *p does not modify a 
or b. 

int a, b, c; 

float *p; 
foo () 

{ 

a = 10; 
b = 20; 

*p = 1.0; 

c = a + b; 

} 

If all global variables are unique, usethe+Onoptrs_to_giobais option 
without listing the global variables (that is, without using namelist). 
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I n the example below, the address of b is taken. This means b is accessed 
indirectly through the pointer. You can still use +Onoptrs_to_giobais 
as: 

+Onoptrs_to_globals +Optrs_to_globals=b. 

int b,c; 

int *p 

p=Sb; 

foo () 

For more information about type aliasing, seethe section "C aliasing 
options” section on page 143. 

+0[no]regreassoc 
Optimization level: + 02 , +03, +04 
Default: +Oregreassoc 

+0 [no] regreassoc enables or disables register reassociation. This is a 
techniquefor folding and eliminating integer arithmetic operations 
within loops, especially those used for array address computations. 

This optimization provides a code-improving transformation 
supplementing loop-invariant code motion and strength reduction. 
Additionally, when performed in conjunction with software pipelining, 
register reassociation can also yield significant performance 
improvement. 
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Table 17 


+o [no] report [=r epor ttype] 

Optimization level: +03, +04 
Default: +Onoreport 

+Oreport [=report_type] specifies the contents of the Optimization 
Report. Values of report_typeand the Optimization Reports they produce 
are shown in Table 17. 


Optimization Report contents 


reporttype value 

Report contents 

all 

Loop Report and Privatization Table 

loop 

Loop Report 

private 

Loop Report and Privatization Table 

report_type not given 
(default) 

Loop Report 


The Loop Report gives information on optimizations performed on loops 
and calls. Using +Oreport (without =report_type) also produces the 
Loop Report. 


The Privatization Table provides information on loop variables that are 
privatized by the compiler. 

+Oreport [=report_type] is active only at +03 and above. 

The +Onoreport option does not accept any of the report_type values. 
For more information about the Optimization Report, see "Optimization 
Report," on page 151. 

+oinfo alsodisplays information on the various optimizations being 
performed by the compilers. +oinfo is used at any optimization level, 
but is most useful at +03 and above. The default at all optimization 
levels is +Onoinfo. 
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+0[no]sharedgra 
Optimization level: +02, +03, +04 
Default: +Osharedgra 

The +Onosharedgra option disables global register allocation for 
shared-memory variables that are visibleto multiplethreads. This 
option may help if a variable shared among parallel threads is causing 
wrong answers. See the section "Global register allocation (GRA)"section 
on page43 for more information. 

Global register allocation (+Osharedgra) is enabled by default at 
optimization level +02 and higher. 

+0[no]signedpointers 
Optimization level: +02, +03, +04 
Default: +Onosignedpointers 

This option is not available in the HP Fortran compiler. 

TheC and C++option +0 [no] signedpointers requests that the 
compiler perform or not perform optimizations related to treating 
pointers as signed quantities. This helps improve application runtime 
speed. Applications that allocate shared memory and that compare a 
pointer to shared memory with a pointer to private memory may run 
incorrectly if this optimization is enabled. 

+0[no]size 
Optimization level: +02, +03, +04 
Default: +Onosize 

The +osize option suppresses optimizations that significantly increase 
code size. Specifying +Osize implies specifying +oiniine_budget=i. 
See the section "+oiniine_budget=n" on page 125 for additional 
information. 

The +Onosize option does not prevent optimizations that can increase 
code size. 
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NOTE 


+0[no]static_prediction 
Optimization level: +oo, +01, +02, +03, +04 
Default: +0nostatic_prediction 

+0st at ic_predict ion turns on static branch prediction for 
PA-RISC 2.0 targets. Use +Ostatic_prediction to better optimize 
large programs with poor instruction locality, such as operating system 
and database code. 

PA-RI SC 2.0 predicts the direction conditional branches go in one of two 
ways: 

• Dynamic branch prediction uses a hardware history mechanism to 
predict future executions of a branch from its last three executions. It 
is transparent and quite effective, unless the hardware buffers 
involved are overwhelmed by a large program with poor locality. 

• Static branch prediction, when enabled, predicts each branch based 
on implicit hints encoded in the branch instruction itself. The static 
branch prediction is responsible for handling large codes with poor 
locality for which thesmall dynamic hardware facility proves 
inadequate. 

+0[no]vectorize 
Optimization level: +03, +04 
Default: +Onovectorize 

+Ovectorize allows the compiler to replace certain loops with calls to 
vector routines. Use +Ovectorize to increase the execution speed of 
loops. 

This option is not available in the HP aC++ compiler. 

When +Onovectorize is specified, loops are not replaced with calls to 
vector routines. 

Becausethe +Ovectorize option may change the order of floating-point 
operationsin an application, it may also change the results of those 
operations slightly. Seethe H P-UX Floating-Point Guide for more 
information. 
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The math library contains special prefetching versions of vector routines. 
If you have a PA2.0 application containing operations on large arrays 
(larger than 1 Megabyte in size), using +Ovectorize in conjunction 
with +odataprefetch may improve performance. 

+Ovectorize is also included as part of the +oaggressive and +Oall 
options. 

+0[no]volatile 
Optimization level: +01, +02, +03, +04 
Default: +Onovolatile 

This option is not available in the HP Fortran compiler. 

TheC and C++option +Ovoiatiie implies that memory references to 
global variables cannot be removed during optimization. 

The +Onovoiatile option indicates that all globals are not of volatile 
class. This means that references to global variables are removed during 
optimization. 

Usethis option to control the volatile semantics for all global variables. 

+0[no]whole program mode 

Optimization level: +04 

Default: +Onowhole_program_mode 

Use +Owhoie_program_mode to increase performance speed. This 
should be used only when you are certain that only the files compiled 
with +Owhoie_program_mode directly access any globals that are 
defined in these files. 

This option is not available in the HP Fortran or aC++ compilers. 

+Owhoie_program_mode enables the assertion that only the files that 
are compiled with this option directly reference any global variables and 
procedures that are defined in these files. In other words, this option 
asserts that there are no unseen accesses to the globals. 

When this assertion is in effect, the optimizer can hold global variables 
in registers longer and delete inlined or cloned global procedures. 
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All files compiled with +Owhoie_program_mode must also be compiled 
with +04. If any of the files were compiled with +04, but were not 
compiled with +Owhoie_program_mode, the linker disables the 
assertion for all files in the program. 

The default, +Onowhoie_program_mode, disables the assertion noted 
above. 

+tm target 

Optimization level: +oo, + 01 , + 02 , +03, +04 

Default target value: corresponds to the machine on which you invoke 
the compiler. 

This option specifies the target machine architecture for which 
compilation is to be performed. Using this option causes the compiler to 
perform architecture-specific optimizations. 

target takes one of the foil owing values: 

• K8000 tospecify K-Class servers using PA-8000 processors 

• V2000 tospecify V2000 servers 

• V 2200 tospecify V2200 servers 

• V2250 tospecify V2250 servers 

This option is valid at all optimization levels. The default target value 
corresponds to the machine on which you invoke the compiler. 

Using the +tm target option implies +da and +ds settings as described in 
Table 18. +DAarchitecturecauses the compiler to generate code for the 
architecture specified by architecture. +Dsmodel causes the compiler to 
use the instruction scheduler tuned to model. Seethef90(l) man page, 
aCC(l) page, orthecc(l) man page for more information describing the 
+da and +ds options. 
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Table 18 


+tm target and +da/+ds 


target value specified 

+DAarchitecture 

implied 

+ds model 
implied 

K8000 

2.0 

2.0 

V2000 

2.0 

2.0 

V2200 

2.0 

2.0 

V2250 

2.0 

2.0 


If you specify +da or +ds on the compiler command line, your setting 
takes precedence over the setting implied by +tm target. 
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C aliasing options 

The optimizer makes a conservative assumption that a pointer can point 
to any object in the entire application. Command-line options to the C 
compiler are avail able to inform the optimizer of an application's pointer 
usage. Using this information, the optimizer can generate more efficient 
code, due to the elimination of some false assumptions. 

You can direct pointer behavior to the optimizer by using the following 
options: 

• +0[no]ptrs_strongly_typed 

• +0 [no] ptrs_to_globals [=namel ist] 

• +0 [no] global_ptrs_unique [=namel ist] 

• +0[no]ptrs_ansi 

where 

namelist is a comma-separated list of global variable names. 

The following are type-inferred aliasing rules that apply when using 
these +o optimization options: 

• Type-aliasing optimizations are based on the assumption that pointer 
dereferences obey their declared types. 

• A C variable is considered address-exposed if and only if the address 
of that variable is assigned to another variable or passed to a function 
as an actual parameter. I n general, address-exposed objects are 
collected into a separate group, based on their declared types. Global 
and static variables are considered address-exposed by default. Local 
variables and actual parameters are considered address-exposed only 
if their addresses have been computed using the address operator &. 

• Dereferences of pointers to a certain type are assumed to only alias 
with the corresponding equivalent group. An equivalent group 
includes all the address-exposed objects of the same type. The 
dereferences of pointers are also assumed to alias with other pointer 
dereferences associated with the same group. 
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For example, in the following line: 

int *p, *q; 

*p and *q are assumed to alias with any objects of type int. Also, *p 
and *q are assumed to alias with each other. 

• Signed/unsigned type distinctions are ignored in grouping objects into 
an equivalent group. Likewise, long and int types are considered to 
map to the same equivalent group. However, the volatile type 
qualifier is considered significant in grouping objects into equivalent 
groups. For example, a pointer to int is not considered toalias with a 
volatile int object. 

• I f two type names reduce to the same type, they are considered 
synonymous. 

I n the foil owing example, both types type_oidand type_new reduce 
to the same type, struct too. 

typedef struct foo_st type_old; 

typedef type_old type_new; 

• Each field of a structure type is placed in a separate equivalent group 
that is distinct from theequivalent group of the field's base type. The 
assumption here is that a pointer to int is not assigned the address 
of a structure field whose type is int. The actual type name of a 
structure type is not considered significant in constructing equivalent 
groups. For example, dereferences of a struct foo pointer and a 
struct bar pointer is assumed to alias with each other even if 
struct foo and struct bar have identical field declarations. 

• All fields of a union type are placed in the same equivalent group, 
which is distinct from the equivalent group of any of the field's base 
types. This means that all dereferences of pointers to a particular 
union type are assumed to alias with each other, regardless of which 
union field is being accessed. 

• Address-exposed array variables are grouped into theequivalent 
group of the array element type. 

• Applying an explicit pointer typecast to an expression value causes 
any later use of the typecast expression value to be associated with 
the equivalent group of the typecast expression value. 
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For example, an int pointer typecast into a float pointer and then 
dereferenced is assumed to potentially access objects in the float 
equivalent group—and not the int equivalent group. 

However, type-in compatible assignments to pointer variables do not 
alter the aliasing assumptions on subsequent references of such 
pointer variables. 

I n general, type-incompatible assignments can potentially invalidate 
some of the type-safe assumptions. Such constructs may elicit 
compiler warning messages. 
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Table 19 


Optimization directives and pragmas 

This section lists the directives, and pragmas availablefor use in 
optimization. Table 19 below describes the options and the optimization 
levels at which they are used. The pragmas are not supported by the 
aC++compiler. 

The loop_parallel, parallel, prefer_parallel, and 
end_paraiiei options are described in "Parallel programming 
techniques," on page 175. 

Directive-based optimization options 


Directives and Pragmas 

Valid 

Optimization 

levels 

block_loop [(block_factor=n)] 

+03, +04 

dynsel [ (trip_count=.n) ] 

+03, +04 

no_block_loop 

+03, +04 

no_distribute 

+03, +04 

no_dynsel 

+03, +04 

no_loop_dependence (namelist) 

+03, +04 

no_loop_transform 

+03, +04 

no_parallel 

+03, +04 

no_side_effects 

+03, +04 

no_unroll_and_jam 

+03, +04 

reduction( namelist) 

+03, +04 

scalar 

+03, +04 

sync_routine (routinelist) 

+03, +04 

unroll_and_jam[(unroll_factor=n)] 

+03, +04 
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NOTE 

Table 20 


Rules for usage 

The form of the optimization directives and pragmas is shown in 
Table 20. 

The HP aC++ compiler does not support the optimization pragmas 
described in this section. 

Form of optimization directives and pragmas 


Language 

Form 

Fortran 

c$dir directive-list 

C 

#pragma _cnx directive-list 


where 

directive-list 

is a comma-separated list of one or more of the 
directives/pragmas described in this chapter. 

• Directive names are presented here in lowercase, and they may be 
specified in either casein both languages. However, #pragma must 
always appear in lowercase in C. 

• I n the sections that follow, namel ist represents a comma-separated 
list of names. These names can be variables, arrays, or common 
blocks. I n the case of a common block, its name must be enclosed 
within slashes. The occurrence of a lowercase n or m is used to 
indicate an integer constant. 

• Occurrences of gate_var are for variables that have been or are being 
defined as gates. Any parameters that appear within square brackets 
([ ]) are optional. 
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block_loop[(block_factor=n)] 

biock_ioop [ (biock_f actor=n) ] indicates a specific loop to block 
and, optionally, the block factor n. This block factor is used in the 
compiler's internal computation of loop nest-based data reuse; this is the 
number of times that the data reuse has resulted as a result of loop 
nesting. This figure must be an integer constant greater than or equal 
to2. If nobiock_factor is specified, the compiler uses a heuristicto 
determinethebiock_factor. For more information on loop blocking, 
refer to "Optimization levels" section on page 25. 

dynsei [ (trip_count=n) ] 

dynsei [ (tri p_count=n) ] enables workload-based dynamic selection for 
the immediately following loop. trip_count represents the 
thread_trip_count attribute, and n is an integer constant. 

• When thread_trip_count = n is specified, theserial version of the 
loop is run if the iteration count is less than n. Otherwise, the 
thread-parallel version is run. 

• For more information on dynamic selection, refer to the description of 
the optimization option "+o [no] dynsei" on page 120. 

no_block_loop 

no_biock_ioop disables loop blocking on the immediately following 
loop. For more information on loop blocking, see the description of 
biock_ioop [ (biock_f actor=n) ] in this section, or refer to the 
description of the optimization option "+O[no] ioop_biock"on 
page 127. 

no_distribute 

no_distribute disables loop distri bution for the immediately foil owing 
loop. For more information on loop distribution, refer to the description of 
the optimization option "+o [no] ioop_transform" on page 127. 
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no_dynsel 

no_dynsei disables workload-based dynamic selection for the 
immediately following loop. For more information on dynamic selection, 
refer to the description of the optimization option "+o [no] dynsei” on 
page 120. 

no_loop_dependence ( na mel i St) 

no_ioop_dependence (namelist) informs the compiler that the arrays 
in namelist do not have any dependences for iterations of the 
immediately following loop. Use no_ioop_dependence for arrays only. 
Use ioop_private to indicate dependence-free scalar variables. 

This directive or pragma causes the compiler to ignore any dependences 
that it perceives to exist. This can enhance the compiler’s ability to 
optimize the loop, including parallelization. 

For more information on loop dependence, refer to "Loop-carried 
dependences" section on page 292. 

no_loop_transform 

no_ioop_transf orm prevents thecompiIer from performing reordering 
transformations on thefollowing loop. The compiler does not distribute, 
fuse, block, interchange, unroll, or unroll and jam a loop on which this 
directive appears. For more information on no_ioop_transform, refer 
totheoptimization option "+o [no] ioop_transform" on page 127. 

no parallel 

no_paraiiei prevents the compi ler from generating parallel code for 
the immediately following loop. For more information on no_paraiiei, 
refer to the optimization option "+o [no] parallel" on page 129. 
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no_side_effects (fundiSt) 

no_side_ef fects (funclist) informs the compiler that the functions 
appearing in funclist have no side effects wherever they appear lexically 
following thedirective. Side effects include modifying a function 
argument, modifying a Fortran common variable, performing I/O, or 
calling another routine that does any of the above. The compiler can 
sometimes eliminate cal Is to procedures that have no side effects. The 
compiler may also be ableto parallelize loops with calls when informed 
that the called routines do not have side effects. 

unroll_and_jam[(unroll_factor=n)] 

unroii_and_jam[(unroii_factor=n) ] causes one or more 
noninnermost loops in the immediately following nest to be partially 
unrolled (to a depth of n if unroii_factor is specified), then fuses the 
resulting loops back together. It must be placed on a loop that ends up 
being noninnermost after any compiler-initiated interchanges. For more 
information on unroii_and_jam, refer to the description of 
"+0 [no] loop_unroll_jam" on page 128. 
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Optimization Report 


The Optimization Report is produced by theHP Fortran, HP aC++, and 
H P C compilers. It is most useful at optimization levels +03 and + 04 . 
This chapter includes a discussion of the foil owing topics: 

• Optimization Report contents 

• Loop Report 
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Table 21 


Optimization Report contents 

When you compile a program with the +Oreport [=report_type] 
optimization option at the +03 and +04 levels, the compiler generates an 
Optimization Report for each program unit. The 
+Oreport [=report_type] option determines the report's contents based 
on the value of report_type, as shown in Table 21. 


Optimization Report contents 


reporttype val ues 

Report contents 

all 

Loop Report and Privatization Table 

loop 

Loop Report 

private 

Loop Report and Privatization Table 

report_type not given 
(default) 

Loop Report 


The +Onoreport option does not accept any of the report_type values. 
Sample Optimization Reports are provided throughout this chapter. 
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Loop Report 

The Loop Report lists the optimizations that are performed on loops and 
calls. If appropriate, the report gives reasons why a possible 
optimization was not performed. Loop nests are reported in theorder in 
which they are encountered and separated by a blank line. 

Below is a sample optimization report. 

Optimization Report 

Line Id Var Reordering New Optimizing / Special 

Num. Num. Name Transformation Id Nums Transformation 


3 1 subl *Inlined call 

8 2 iloopi:l Serial 

11 3 jloopi:2 Serial 

14 4 kloopi:3 Serial 

*Fused 

8 5 iloopi:1 PARALLEL 

Footnoted User 
Var Name Var Name 


iloopi:1 iloopindex 

jloopi:2 jloopindex 

kloopi:3 kloopindex 

Optimization for subl 

Line Id Var Reordering New Optimizing / Special 

Num. Num. Name Transformation Id Nums Transformation 


8 1 iloopi:1 Serial 

11 2 jloopi:2 Serial 

14 3 kloopi:3 Serial 

*Fused (4) 

8 4 iloopi:1 PARALLEL 

Footnoted User 
Var Name Var Name 


iloopi:1 iloopindex 
jloopi:2 jloopindex 
kloopi:3 kloopindex 


Fused 

Fused 

Fused 

(1 2 3) -> (4) 


(2-4) 

Fused 

Fused 

Fused 

(5) (2 3 4) -> (5) 
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A description of each column of the Loop Report is shown in Table 22. 


Table 22 Loop Report column definitions 


Column 

Description 

Line Num. 

Specifies the source line of the beginning of the loop or of the loop 
from which it was derived. For cloned calls and inlined calls, the 

Line Num. column specifies the source line at which the cal 1 
statement appears. 

Id Num. 

Specifies a unique 1 D number for every optimized loop and for every 
optimized call. This 1D number can then be referenced by other parts 
of the report. Both loops appearing in the original program source 
and loops created by thecompiler are given loop ID numbers. Loops 
created by thecompiler arealsoshown in theNew id Nums column 
as described later. No distinction between compiler-generated loops 
and loops that existed in theoriginal sourceis madein the id Num. 
column. Loops are assigned unique, sequential numbers as they are 
encountered. 

Var Name 

Specifies the name of the iteration variable control ling the loop or the 
called procedure if the line represents a call. If the variable is 
compiler-generated, its name is listed as *var*. If it consists of a 
truncated variable name followed by a colon and a number, the 
number is a reference to the variable name footnote table, which 
appears after the Loop Report and Analysis Table in the 

Optimization Report. 

Reordering 

Transformation 

1 ndicates which reordering transformations were performed. 
Reordering transformations are performed on loops, calls, and loop 
nests, and typically involve reordering and/or duplicating sections of 
code to facilitate more efficient execution. This column has one of the 
values shown in Table 23 on page 155. 

New Id Nums 

Specifies the 1 D number for loops or calls created by the compiler. 
These 1D numbers are listed in the id Num. column and is referenced 
in other parts of the report. H owever, the loops and calls they 
represent were not present i n the origi nal source code. 1 n the case of 
loop fusion, the number in this column indicates the new loop created 
by merging all the fused loops. New ID numbers are also created for 
cloned calls, inlined calls, loop blocking, loop distribution, loop 
interchange, loop unroll and jam, dynamic selection, and test 
promotion. 
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Column 

Description 

Optimizing / 
Special 

Transformation 

1 ndicates which, if any, optimizing transformations were performed. 

An optimizing transformation reduces the number of operations 
executed, or replaces operations with simpler operations. A special 
transformation allows the compiler to optimize code under special 
circumstances. When appropriate, this column has one of the values 
shown in Table 24 on page 157. 


The following values apply tothe Reordering T ransformation column 
described in Table 22 on page 154. 


Table 23 Reordering transformation values in the Loop Report 


Value 

Description 

Block 

Loop blocking was performed. The new loop order is indicated under 

the Optimizing/Special Transformation column, as shown in 
Table 24. 

Cloned call 

A call to a subroutine was cloned. 

Dist 

Loop distribution was performed. 

DynSel 

Dynamic selection was performed. The numbers in the New id Nums 
column correspond tothe loops created. For parallel loops, these 
generally include a parallel and a Serial version. 

Fused 

The loops were fused into another loop and no longer exist. The 
original loops and the new loop is indicated under the Optimizing/ 
Special Transformation column, as shown in Table 24. 

Inlined call 

A call to a subroutine was inlined. 

Interchange 

Loop interchange was performed. The new loop order is indicated 

Under the Optimizing/Special Transformation column, as 
shown in Table 24. 

None 

No reordering transformation was performed on the call. 

PARALLEL 

The loop runs in thread-parallel mode. 

Peel 

Thefirst or last iteration of the loop was peeled in order to fuse the 
loop with an adjacent loop. 

Promote 

Test promotion was performed. 
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Value 

Description 

Serial 

No reordering transformation was performed on the loop. 

Unroll and Jam 

The loop was unrolled and the nested loops were jammed (fused). 

VECTOR 

The loop was fully or partially replaced with more efficient calls to one 
or more vector routines. 

* 

Appears at left of loop-producing transformation optimizations 
(distribution, dynamic selection, blocking, fusion, interchange, call 
cloning, call inlining, peeling, promotion, unroll and jam). 
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The following values apply to the Optimizing/special 
transformations column described in Table 22 on page 154. 


Table 24 Optimizing/special transformations values in the Loop Report 


Value 

Explanation 

Fused 

The loop was fused into another loop and no longer 
exists. 

Reduction 

The compiler recognized a reduction in the loop. 

Removed 

The compiler removed the loop. 

Unrolled 

The loop was completely unrolled. 

(OrigOrder) -> (1 nterchangedOrder) 

This information appears when interchange is 
reported under Reordering Transformation. 

OrigOrder indicates the order of loops in the original 
nest. 1 nterchangedOrder indicates the new order that 
occurs due to interchange. OrigOrder and 

1 nterchangedOrder consist of user iteration variables 
presented in outermost to innermost order. 

(OrigLoops)->(NewLoop) 

This information appears when Fused is reported 
under Reordering Transformation. OrigLoops 
i ndicates the origi nal loops that were fused by the 
compiler to form the loop indicated by NewLoop. 
OrigLoops and NewLoop refer to loops based on the 
values from the id Num. and New id Nums columns 
in the Loop Report. 

(Ori gL oopN est)->(B 1 ocked L oopN est) 

This information appears when Block is reported 
Under Reordering Transformation. 

OrigLoopNest indicates the order of the original loop 
nest containing a loop that was blocked. 

Blocked Loop Nest indicates the order of loops after 
blocking. OrigLoopNest and Bl ocked L oopN est refer to 
user iteration variables presented in outermost to 
innermost order. 
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Table 25 


Supplemental tables 

The tables described in this section may be included in the 
Optimization Report to provide information supplemental to the 
Loop Report. 

Analysis Table 

If necessary, an Analysis Table is included in the Optimization Report to 
further elaborateon optimizations reported in theLoop Report. 

A description of each column in the Analysis Table is shown in Table 25. 


Analysis Table column definitions 


Column 

Description 

Line Num. 

Specifies the source line of the beginning of the loop 
or call. 

Id Num. 

References the 1D number assigned to the loop or call 
in the Loop Report. 

Var Name 

Specifies the name of the iteration variable 
controlling the loop, *var* (as discussed in the var 
Name description in the section "Loop Report" on 
page 153). 

Analysis 

1 ndicates why a transformation or optimization was 
not performed, or additional information on what 
was done. 
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Table 26 


Privatization Table 

This table reports any user variables contained in a parallelized loop 
that are privatized by the compiler. Because the Privatization Table 
refers to loops, the Loop Report is automatically provided with it. 

A description of each column in the Privatization Table is shown in Table 
26. 

Privatization Table column definitions 


Column 

Definitions 

Line Num. 

Specifies the source line of the beginning of the 
loop. 

Id Num. 

References the 1 D number assigned to the loop 
in the loop table. 

Var Name 

Specifies the name of the iteration variable 
controlling the loop. *var* may also appear in 
this column, as discussed in thevar Name 
description in the section "Loop Report" on 
page 153. 

Priv Var 

Specifies the name of the privatized user 
variable. Compiler-generated variables that are 
privatized are not reported here. 

Privatization 

Information 

for Parallel 

Loops 

Provides moredetail on the variable 
privatizations performed. 
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Example 


% f90 +03 


Optimization Report 

Loop Report 


Variable Name Footnote Table 

Variable names that are too long to fit in thevar Name columns of the 
other tables are truncated and followed by a colon and a footnote 
number. These footnotes are explained in the Variable Name Footnote 
Table. 

A description of each column in the Variable Name FootnoteTable is 
shown in Table 27. 


Variable Name FootnoteTable column definitions 


Column 

Definition 

Footnoted Var Name 

Specifies thetruncated variable name and 
its footnote number. 

User Var Name 

Specifies the full name of the variable as 
identified in the source code. 


Optimization Report 

The following Fortran program is the basis for the Optimization Report 
shown in this example. Line numbers are provided for ease of reference. 

1 PROGRAM EXAMPLE99 

2 REAL A(100), B(100), C(100) 

3 CALL SUB1(A,B,C) 

4 END 

5 

6 SUBROUTINE SUB1(A,B,C) 

7 REAL A(100), B(100), C(100) 

8 DO ILOOPINDEX—1,100 

9 A(ILOOPINDEX) = ILOOPINDEX 

10 ENDDO 

11 DO JLOOPINDEX—1,100 

12 B(JLOOPINDEX) = A(JLOOPINDEX)**2 

13 ENDDO 

14 DO KLOOPINDEX=1, 100 

15 C(KLOOPINDEX) = A(KLOOPINDEX) + B(KLOOPINDEX) 

16 ENDDO 

17 PRINT *, A(l), B(50), C(100) 

18 END 

The following Optimization Report is generated by compiling the 
program example99 with the command-line options +03 +Oparaiiei 
+Oreport=all +Oinline=subl: 

+Oparallel +Oreport=all +Oinline=subl EXAMPLE99.f 
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Optimization for EXAMPLE99 


Line 

Id 

Var 

Reordering 

New 

Optimizing 

/ Special 

Num. 

Num. 

Name 

Transformation 

Id Nums 

Transformation 

3 

i 

subl 

*Inlined call 

(2-4) 



8 

2 

iloopi:1 

Serial 


Fused 


11 

3 

jloopi:2 

Serial 


Fused 


14 

4 

kloopi:3 

Serial 


Fused 





*Fused 

(5) 

(2 3 4) -> 

(5) 

8 

5 

iloopi:1 

PARALLEL 




Footnoted 

User 





Var Name 

Var 

Name 





iloopi:1 

iloopindex 





jloopi:2 

jloopindex 





kloopi:3 

kloopindex 






Optimization 

for subl 




Line 

Id 

Var 

Reordering 

New 

Optimizing 

/ Special 

Num. 

Num. 

Name 

Transformation 

Id Nums 

Transformation 

8 

i 

iloopi:1 

Serial 


Fused 


11 

2 

jloopi:2 

Serial 


Fused 


14 

3 

kloopi:3 

Serial 


Fused 





*Fused 

(4) 

(1 2 3) -> 

(4) 

8 

4 

iloopi:1 

PARALLEL 




Footnoted 

User 





Var Name 

Var 

Name 






iloopi:1 iloopindex 
jloopi:2 jloopindex 
kloopi:3 kloopindex 


The Optimization Report for example99 provides the foil owing 
information: 

• Call to subl is inlined 

Thefirst line of the Loop Report shows that the call to subl was 
inlined, as shown below: 

3 1 subl *Inlined call (2-4) 

• Three new loops produced 

The inlining produced three new loops in example 99: Loop #2, 
Loop #3, and Loop #4. Internally, the example99 module that 
originally looked like: 
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1 PROGRAM EXAMPLE99 

2 REAL A(100), B(100), C(100) 

3 CALL SUB1(A,B,C) 

4 END 

now looks like this: 


PROGRAM EXAMPLE99 

REAL A(100), B(100), C(100) 

DO ILOOPINDEX—1,100 

A(ILOOP INDEX) = ILOOPINDEX 
ENDDO 

DO JLOOPINDEX=l,100 

B(JLOOPINDEX) = A(JLOOPINDEX) 
ENDDO 

DO KLOOPINDEX=1, 100 

C(KLOOPINDEX) = A(KLOOPINDEX) 
ENDDO 

PRINT *, A(l), B (50), C(100) 

END 


!Loop #2 


!Loop #3 

* *2 

! Loop #4 
+ B(KLOOPINDEX) 


• New loops are fused 

These lines indicate that the new loops have been fused. The 
following line indicates that the three loops were fused into one new 
loop, Loop #5. 

8 2 iloopi:1 Serial Fused 

11 3 jloopi:2 Serial Fused 

14 4 kloopi:3 Serial Fused 

*Fused (5) (2 3 4) (5) 

After fusing, thecode internally appears as the following: 

PROGRAM EXAMPLE99 

REAL A(100), B(100), C(100) 

DO ILOOPINDEX=1,100 !Loop #5 

A(ILOOPINDEX) = ILOOPINDEX 
B(ILOOPINDEX) = A(ILOOPINDEX)* *2 
C(ILOOPINDEX) = A(ILOOPINDEX) + B(ILOOPINDEX) 

ENDDO 

PRINT *, A(1), B (5 0), C(100) 

END 
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• New loop is parallelized 

I n the foil owing Loop Report line: 

8 5 iloopi:1 PARALLEL 

Loop #5 uses iloopi :i as the iteration variable, referencing the 
Variable Name Footnote Table; iloopi :i corresponds to iloopindex. 
The same line in the report also indicates that the newly-created 
Loop #5 was parallelized. 

• Variable Name Footnote Table lists iteration variables 

According to the Variable Name Footnote Table (duplicated below), 
the original variable iloopindex is abbreviated by the compiler as 
iloopi : l sothat it fits into the var Name columns of other reports. 

j loop index and kioop index are abbrevi ated as j loopi : 2 and 
kioopi : 3, respectively. These names are used throughout the report 
to refer to these iteration variables. 

Footnoted User 
Var Name Var Name 


iloopi:1 
jloopi:2 
kioopi : 3 


iloopindex 

jloopindex 

kloopindex 


Optimization Report 

Thefollowing Fortran code provides an example of other transformations 
the compiler performs. Line numbers are provided for ease of reference. 

1 PROGRAM EXAMPLE100 

2 

3 INTEGER IAl(lOO), IA2(100), IA3(100) 

4 INTEGER II, 12 

5 

6 DO I = 1, 100 

7 IA1 (I) = I 

8 IA2 (I) = I * 2 

9 IA3 (I) = I * 3 

10 ENDDO 

11 

12 II = 0 

13 12 = 100 

14 CALL SUB1 (IA1, IA2, IA3, II, 12) 

15 END 

16 

17 SUBROUTINE SUB1(A, B, C, S, N) 

18 INTEGER A (N) , B (N) , C (N) , S, I, J 

19 DO J = 1, N 

20 DO I = 1, N 

21 IF (I .EQ. 1) THEN 
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22 

s = s 

+ A(I) 

23 

ELSE IF 

(I .EQ. N) THEN 

24 

S = S 

+ B (I) 

25 

ELSE 


26 

S = S 

+ C (I) 

27 

ENDIF 


28 

ENDDO 


29 

ENDDO 


30 

END 



The following Optimization Report is generated by compiling the 
program exampleioo for parallelization: 

% f90 +03 +Oparallel +Oreport=all examplelOO.f 

Optimization for SUB1 


Line 

Id 

Var 

Reordering 

New 

Optimizing / Special 

Num. 

Num. 

Name 

Transformation 

Id Nums 

Transformation 

19 

i 

j 

*Interchange 

(2) 

(j i) -> (i j) 

20 

2 

i 

*DynSel 

(3-4) 


20 

3 

i 

PARALLEL 


Reduction 

19 

5 

j 

^Promote 

(6-7) 


19 

6 

j 

Serial 



19 

7 

j 

Serial 



20 

4 

i 

Serial 



19 

8 

j 

^Promote 

(9-10) 


19 

9 

j 

Serial 



19 

10 

j 

^Promote 

(11-12) 


19 

11 

j 

Serial 



19 

12 

j 

Serial 



Line 

Id 

Var 

Analysis 



Num. 

Num. 

Name 




19 

5 

j 

Test on line 21 

promoted out 

of loop 

19 

8 

j 

Test on line 21 

promoted out 

of loop 

19 

10 

j 

Test on line 23 

promoted out 

of loop 

The report is 

continued 

on the next page 




Optimization 

for clone 1 of , 

SUB1 (6_e70_cl_subl) 

Line 

Id 

Var 

Reordering 

New 

Optimizing / Special 

Num. 

Num. 

Name 

Transformation 

Id Nums 

Transformation 

19 

i 

j 

*Interchange 

(2) 

(j i) -> (i j) 

20 

2 

i 

PARALLEL 


Reduction 

19 

3 

j 

^Promote 

(4-5) 


19 

4 

j 

Serial 



19 

5 

j 

^Promote 

(6-7) 


19 

6 

j 

Serial 



19 

7 

j 

Serial 



Line 

Id 

Var 

Analysis 
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Num. Num. Name 


19 

3 

j 

Test on line 21 

promoted out 

of loop 

19 

5 

j 

Test on line 23 

promoted out 

of loop 


Optimization 

for examplelOO 



Line 

Id 

Var 

Reordering 

New 

Optimizing / Special 

Num. 

Num. 

Name 

Transformation 

Id Nums 

Transformation 

6 

i 

i 

Serial 



14 

2 

subl 

‘Cloned call 

(3) 


14 

3 

subl 

None 



Line 

Id 

Var 

Analysis 



Num. 

Num. 

Name 




14 

2 

subl 

Call target changed to clone 

1 of SUB1 (6_e70_cl_subl 


The Optimization Report for EXAM PLE100 shows Optimization Reports 
for the subroutine and its clone, followed by the optimizations to the 
subroutine. It includes the foil owing information: 

• Original subroutine contents 

Originally, thesubroutine appeared as shown below: 


17 

SUBROUTINE 

SUBl (A, B, C, S, N) 

18 

INTEGER A(N) , 

B(N) , C(N) , S, I 

19 

DO J = 1, 

N 


20 

DO I = 

1, 

N 

21 

IF (I 

.EQ. 1) THEN 

22 

S = 

S 

t A (I) 

23 

ELSE 

IF 

(I .EQ. N) THEN 

24 

S = 

S 

t B (I) 

25 

ELSE 



26 

S = 

S 

t C (I) 

27 

END IF 



28 

ENDDO 



29 

ENDDO 



30 

END 




• Loop interchange performed first 

Thecompiler first performs loop interchange (listed as interchange 
in the report) to maximize cache performance: 

19 1 j ‘Interchange (2) (j i) -> (i j) 
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• The subroutine then becomes the following 

17 SUBROUTINE SUB1(A, B, C, S, N) 

18 INTEGER A (N) , B (N) , C (N) , S, I, J 

19 DO I = 1, N 

20 DO J = 1, N 

21 IF (I .EQ. 1) THEN 

22 S = S + A(I) 

23 ELSE IF (I .EQ. N) THEN 

24 S = S + B (I) 

25 ELSE 

26 S = S + C(I) 

27 END IF 

28 ENDDO 

29 ENDDO 

30 END 

• The program is optimized for parallelization 

The compiler would liketo parallelize the outermost loop in the nest, 
which is now the i loop. However because the value of n is not known, 
the compiler does not know how many times the i loop needs to be 
executed. Toensurethat the loop is executed as efficiently as possible 
at runtime, thecompiler replaces the i loop nest with two new copies 
of the i loop nest, oneto be run in parallel, theother to be run 
serially. 

• Dynamic selection is executed 

An if is then inserted to select the more efficient version of the loop 
to execute at runtime. This method of making one copy for parallel 
execution and one copy for serial execution is known as 
dynamic selection, which is enabled by default when 
+03 +Oparaiiei isspecified (see"Dynamicselection"on page 102for 
more information). This optimization is reported in the Loop Report 
in the line: 

20 2 i *DynSel (3-4) 

• Loop#2 creates two loops 

According to the report, Loop #2 was used to create the new loops, 
Loop #3 and Loop #4. 1 nternally, thecode now is represented as 
follows: 

SUBROUTINE SUB1(A, B, C, S, N) 

INTEGER A (N), B (N) , C (N) , S, I, J 

if (n . gt. some_threshold) then 


! Loop #2 
! Loop #1 
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DO (parallel) f = 1, N 
DO J = 1, N 

IF (I .EQ. 1) THEN 
S = S + A (I) 

ELSE IF (I .EQ. N) THEN 
S = S + B (I) 

ELSE 

S = S + C(I) 

END IF 
ENDDO 
ENDDO 
ELSE 

DO I = 1, N 
DO J = 1, N 

IF (I .EQ. 1) THEN 
S = S + A (I) 

ELSE IF (I .EQ. N) THEN 
S = S + B (I) 

ELSE 

S = S + C(I) 

END IF 
ENDDO 
ENDDO 
ENDIF 
END 

• Loop#3 contains reductions 

Loop #3 (which was parallelized) also contained oneor more 
reductions. The Reordering Transformation column indicates 
that the if statements were promoted out of Loop #5, Loop #8, and 
Loop #10. 

• Analysis Table lists new loops 

The line numbers of the promoted if statements are listed. Thefirst 
test in Loop #5 was promoted, creating two new loops, Loop #6 and 
Loop #7. Similarly, Loop #8 has a test promoted, creating Loop #9 
and Loop #io. Thetest remaining in Loop #io is then promoted, 
thereby creating two additional loops. A promoted test is an if 
statement that is hoisted out of a loop. See the section 'Test 
promotion”on page90for more information. The AnalysisTable 
contents are shown below: 

19 5 j Test on line 21 promoted out of loop 

19 8 j Test on line 21 promoted out of loop 

19 10 j Test on line 23 promoted out of loop 


! Loop #3 
! Loop #5 


! Loop #4 
! Loop #8 
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• DO loop is not reordered 

The following do loop does not undergo any reordering 
transformation: 

6 DO I = 1, 100 

7 IA1 (I) = I 

8 IA2(I) = I * 2 

9 IA3 (I) = I * 3 

10 ENDDO 

This fact is reported by the line 

6 1 i Serial 

• subl is cloned 

The call to the subroutine subl is cloned. As indicated by the 
asterisk (*), the compiler produced a new call. The new call is given 
the ID (3) listed in the New id Nums column. The new call is then 
listed, with None indicating that no reordering transformation was 
performed on the call to the new subroutine. 

14 2 subl *Cloned call (3) 

14 3 subl None 

• cloned call is transformed 

The call to the subroutine is then appended to the Loop Report to 
elaborate on the cloned call transformation. This line shows that 
theclonewas called in place of the original subroutine. 

2 subl Call target changed to clone 1 of SUB1 (6_e70_cl_subl) 
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Optimization Report 

The following Fortran code shows loop blocking, loop peeling, loop 
distribution, and loop unroll and jam. Line numbers are listed for ease of 
reference. 

1 PROGRAM EXAMPLE200 

2 


3 

REAL*8 A(1000,1000) 

, B (1000, 1000), C(100 

4 

REAL*8 D(1000), E(1000) 

5 

INTEGER M, N 


6 



7 

N = 1000 


8 

M = 1000 


9 



10 

DO I = 1, N 


11 

C(I) = 0 


12 

DO J = 1, M 


13 

A (I, J) = A (I, J) 

+ B (I, J) * C (I) 

14 

ENDDO 


15 

ENDDO 


16 



17 

DO I = 1, N-l 


18 

D (I) = I 


19 

ENDDO 


20 



21 

DO J = 1, N 


22 

E (J) = D (J) + 1 


23 

ENDDO 


24 



25 

PRINT *, A(103,103) 

, B (517, 517), D (11), 

26 



27 

END 



The following Optimization Report is generated by compiling program 
example200 asfollows: 

% f90 +03 +Oreport +Oloop_block example200.f 
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Optimization for example3 


Line 

Id 

Var 

Reordering 


New 

Optimizing / Special 

Num. 

Num. 

Name 

Transformation 


Id Nums 

Transformation 

10 

i 

i : 1 

*Dist 


(2-3) 


10 

2 

i : 1 

Serial 




10 

3 

i : 1 

*Interchange 


(4) 

\ —i 

-H 

i—1 

■| i 

A 

1 

\ —1 

■| i 

i—1 

-H 

12 

4 

j:l 

*Block 


(5) 

( j : 1 i : 1) -> (i : 1 j xX i 

10 

5 

i : 1 

^Promote 


(6-7) 


10 

6 

i : 1 

Serial 



Removed 

10 

7 

i : 1 

Serial 




12 

8 

j:l 

*Unroll And Jam 


(9) 


12 

9 

j:l 

^Promote 


(10-11) 


12 

10 

j : 1 

Serial 



Removed 

12 

11 

j:l 

Serial 




10 

12 

i : 1 

Serial 




17 

13 

i : 2 

Serial 



Fused 

21 

14 

j :2 

*Peel 


(15) 


21 

15 

j :2 

Serial 



Fused 




*Fused 


(16) 

(13 15) -> (16) 

17 

16 

i : 2 

Serial 




Line 

Id 

Var 

Analysis 




Num. 

Num. 

Name 





10 

5 

i : 1 

Loop blocked by 

56 

iterations 


10 

5 

i : 1 

Test on line 12 

promoted out 

of loop 

10 

6 

i : 1 

Loop blocked by 

56 

iterations 


10 

7 

i : 1 

Loop blocked by 

56 

iterations 


12 

8 

j:l 

Loop unrolled by 8 

iterations 

and jammed into the 

innermost 

loop 






12 

9 

j : 1 

Test on line 10 

promoted out 

of loop 

21 

14 

j :2 

Peeled last iteration of loop 


The Optimization Report for example200 provides the foil owing results: 

10 1 1:1 *Dist (2-3) 

• Several occurrences of variables noted 

I n this report, the var Name column has entries such as i: l, j: l, 
i : 2, and j : 2. This type of entry appears when a variable is used 
more than once. I n example200, i is used as an iteration variable 
twice. Consequently, i: l refers to the first occurrence, and i: 2 
refers to the second occurrence. 
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• Loop #1 creates new loops 

The first line of the report shows that Loop #1, shown on line 10, is 
distributed to create Loop #2 and Loop #3: 

I nitially, Loop #1 appears as shown. 

DO I = 1, N ! Loop #1 

C(I) = 0 
DO J = 1, M 

A(I,J) = A(I,J) + B(I,J) * C(I) 

ENDDO 

ENDDO 

It is then distributed as follows: 

DO I = 1, N ! Loop #2 

C(I) = 0 
ENDDO 

DO I = 1, N ! Loop #3 

DO J = 1, M 

A (I, J) = A (I, J) + B (I, J) * C(I) 

ENDDO 

ENDDO 

• Loop #3 is interchanged to create Loop#4 

Thethird line indicates this: 

10 3 i:1 ^Interchange (4) (i:l j:l) _ > 

( j : 1 i : 1) 

Now, the loop looks I ike the foil owing code: 

DO J = 1, M ! Loop #4 

DO I = 1, N 

A (I, J) = A (I, J) + B (I, J) * C(I) 

ENDDO 

ENDDO 

• Nested loop is blocked 

The next lineof the Optimization Report indicates that the nest 
rooted at Loop #4 is blocked: 

12 4 j:1 *Block (5) (j:l i:1) -> 

(i : 1 j : 1 i : 1) 

The blocked nest internally appears as follows: 

DO IOUT = 1, N, 56 ! Loop #5 

DO J = 1, M 

DO I = IOUT, IOUT + 55 

A (I, J) = A (I, J) + B(I,J) * C(I) 

ENDDO 

ENDDO 

ENDDO 
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• Loop #5 noted as blocked 

The loop with iteration variable i: l is the loop that was actually 
blocked. The report shows *Biock on Loop #4 (the j : l loop) because 
theentire nest rooted at Loop #4 is replaced by the blocked nest. 

• i out variable facilitates loop blocking 

The iout variable is introduced to facilitate the loop blocking. The 
compiler uses a step value of 56 for the iout loop as reported in the 
Analysis Table: 

10 5 i:l Loop blocked by 56 iterations 

• Test promotion creates new loops 

The next three lines of the report show that a test was promoted out 
of Loop #5, creating Loop #6 (which is removed) and Loop #7 
(which is run serially). This test—which does not appear in thesource 
code—is an implicit test that the compiler inserts in the code to 
ensure that the loop iterates at least once. 


10 

5 

i: 1 

‘Promote 

(6-7) 


10 

6 

i : 1 

Serial 


Removed 

10 

7 

i : 1 

Serial 




Thistest is referenced again in thefollowing line from the 
Analysis Table: 

10 5 i:l Test on line 12 promoted out of loop 

• Unroll and jam creates new loop 

The report indicates that the j is unrolled and jammed, creating 

Loop #9: 

12 8 j:1 *Unroll And Jam (9) 

• j loop unrolled by 8 iterations 

This line also indicates that the j loop is unrolled by 8 iterations and 
fused: 

12 8 j:1 Loop unrolled by 8 iterations and jammed 

into the innermost loop 
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The unrolled and jammed loop results in thefollowing code: 


DO IOUT = 1, N, 56 
DO J = 1, M, 8 

DO I = IOUT, IOUT + 

A (I, J) = A (I, J) + 

A (I, J+l) = A (I, J+l 
A (I, J+2 ) = A (I, J+2 
A (I, J+3 ) = A (I, J+3 
A (I, J+4 ) = A (I, J+4 
A (I, J+5 ) = A (I, J+5 
A (I, J+6) = A (I, J+6 
A (I, J+7 ) = A (I, J+7 
ENDDO 
ENDDO 
ENDDO 





! Loop 

#5 




! Loop 

#8 

55 



! Loop 

#9 

B (I 

J) * C(I) 




) + 

B(I, J+l) 

■ k 

C (I) 


) + 

B(I, J+2) 

:k 

C(I) 


) + 

B (I, J+3) 

:k 

C (I) 


) + 

B(I, J+4) 

:k 

C(I) 


) + 

B(I, J+5) 

:k 

C (I) 


) + 

B (I, J+6) 

:k 

C(I) 


) + 

B(I, J+7) 

:k 

C (I) 



• Test promotion in Loop #9 creates new loops 

The Optimization Report indicates that the compiler-inserted test in 
Loop #9 is promoted out the loop, creating Loop #io and 
Loop #11. 


12 

9 


*P romote 

(10-11) 


12 

10 


Serial 


Removed 

12 

11 


Serial 




• Loops are fused 

According to the report, the last two loops in the program are fused 
(once an iteration is peeled off the second loop), then the new loop is 
run serially. 


17 

13 

1:2 

Serial 

Fused 

21 

14 

j :2 

*Peel 

(15) 

21 

15 

j :2 

Serial 

Fused 




*Fused 

(16) (13 15) 

17 

16 

i : 2 

Serial 



That information is combined with thefollowing line from the 
Analysis Tabl e: 

21 14 j:2 Peeled last iteration of loop 

• Loop peeling creates loop, enables fusion 

I nitially, Loop #14 has an iteration peeled to create Loop #15, as 
shown below. The loop peeling is performed to enable loop fusion. 

DO I = 1, N-l ! Loop #13 

D (I) = I 
ENDDO 

DO J = 1, N-l ! Loop #15 

E (J) = D (J) + 1 
ENDDO 
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• Loops are fused to create new loop 

Loop #13 and Loop #15 arethen fused to produce Loop #16: 

DO I = 1, N-l ! Loop #16 

D(I) = I 
E (I) = D (I) + 1 
ENDDO 
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Parallel programming 
techniques 


The H P compiler set provides programming techniques that allow you to 
increase code efficiency while achieving three-tier parallelism. This 
chapter describes thefollowing programming techniques and 
requirements for implementing low-overhead parallel programs: 

• Parallelizing directives and pragmas 

• Parallelizing loops 

• Parallelizing tasks 

• Parallelizing regions 

• Reentrant compilation 

• Setting thread default stack size 

• Collecting parallel information 

The HP aC++ compiler does not support the pragmas described in this 
chapter. 
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Parallelizing directives and pragmas 

This section summarizes the directives and pragmas used to achieve 
parallelization in the H P compilers. The directives and pragmas are 
listed in theorder of how they would typically be used within a given 
program. 


Table 28 Parallel directives and pragmas 


Pragma / Directive 

Description 

Level of 
parallelism 

prefer_parallel 
[ ( attribute_list) } 

Requests parallelization of the immediately 
following loop, accepting attribute combinations 
for thread-parallelism, strip-length adjustment, 
and maximum number of threads. The compiler 
handles data privatization and does not 
parallelize the loop if it is not safetodoso. 

Loop 

loop_parallel 
[ (attribute_list) ] 

Forces parallelization of the immediately 
following loop. Accepts attributes for thread- 
parallelism, strip-length adjustment, maximum 
number of threads, and ordered execution. 
Requires you to manually privatize loop data and 
synchronize data dependences. 

Loop 

parallel 
[ (attribute_list) } 

Allow you to parallelize a single code region to 
run on multiple threads. Uni ike the tasking 
directives, which run discrete sections of code in 
parallel, parallel and end_parallel run 
multiple copies of a single section. Accepts 
attribute combinations for thread-parallelism 
and maximum number of threads. 

Region 


Within a parallel region, loop directives 
(prefer_parallel, loop_parallel) and 
tasking directives (begin_tasks) may appear 
with thedist attribute. 


end_parallel 

Signifies the end of a parallel region (see 

parallel). 

Region 


176 


Chapter9 




Parallel programming techniques 

Parallelizing directives and pragmas 


Pragma / Directive 

Description 

Level of 
parallelism 

begin_tasks 
( attribute_list) 

Defines the beginning of a series of tasks, 
allowing you to parallelize consecutive blocks of 
code. Accepts attribute combinations for 
thread-parallelism, ordered execution, maximum 
number of threads, and others. 

Task 

next_task 

Starts a block of code foil owing a begin_tasks 
block that will be executed as a parallel task. 

Task 

end_tasks 

Terminates parallel tasks started by 

begin_tasks and next_task. 

Task 

ordered_section 

(gate) 

Allows you to isolate dependences within a loop 
so that code contained within the ordered section 
executes in iteration order. Only useful when 
used with loop_parallel (ordered). 

Loop 

critical_section 
[ (gate) ] 

Allows you to isolate nonordered manipulations 
of a shared variable within a loop. Only one 
parallel thread can execute the code contained in 
the critical section at a time, eliminating possible 
contention. 

Loop 

end_critical 

section 

1 dentifies the end of a critical section (see 

critical_section). 

Loop 

reduction 

Forces reduction analysis on a loop being 
manipulated by the ioop_paraiiei directive. 

See "Reductions" on page 108. 

Loop 

sync_routine 

M ust be used to identify synchronization 
functions that you call indirectly call in your own 
routines. See "sync_routine” on page250. 

Loop or Task 
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Parallelizing loops 

The H P compilers automatically exploit loop parallelism in dependence- 
free loops. The prefer_parallel, loop_parallel, and parallel 
directives and pragmas allow you to increase parallelization 
opportunities and to manually control many aspects of parallelization 
using simple manual loop parallelization. 

The prefer_paraiiei and ioop_paraiiei directives and pragmas, 
apply to the immediately following loop. Data privatization is necessary 
when using ioop_paraiiei; this is achieved by using the 
ioop_private directive, discussed in "Data privatization," on 
page 217. M anual data privatization using memory classes is discussed 
in "Memory classes," on page 233 and "Parallel synchronization," on 
page 243. 

The parallel directives and pragmas should only be used on Fortran do 
and C for loops that have iteration counts that are determined prior to 
loop invocation at runtime. 

prefer_parallel 

The prefer_paraiiei directive and pragma causes the compiler to 
automatically parallelize the immediately foil owing loop if it isfreeof 
dependences and other parallelization inhibitors. The compiler 
automatically privatizes any loop variables that must be privatized. 
prefer_paraiiei requires less manual intervention. However, it is less 
powerful than the ioop_paraiiei directive and pragma. 

See "prefer_parallel, loop_parallel attributes" on page 181 
for a description of attributes for this directive. 

prefer_paraiiei can also be used to indicate the preferred loop in a 
nest to parallelize, as shown in thefollowing Fortran code: 

DO J = 1, 100 
C$DIR PREFER_PARALLEL 
DO I = 1, 100 


ENDDO 

ENDDO 
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This code indicates that prefer_parallel causes the compiler to 
choose the innermost loop for parallelization, provided it isfreeof 
dependences. prefer_parallel does not inhibit loop interchange. 

The ordered attribute in a prefer_paraiiei directive is only useful if 
the loop contains synchronized dependences. The ordered attribute is 
most useful in the ioop_paraiiei directive, described in the next 
section. 

loop_parallel 

The ioop_paraiiei directive forces parallelization of the immediately 
following loop. The compiler does not check for data dependences, 
perform variable privatization, or perform parallelization analysis. You 
must synchronize any dependences manually and manually privatize 
loop data as necessary. ioop_paraiiei defaults to thread 
parallelization. 

See "prefer_parallel, loop_parallel attributes" on page 181 
for a description of attributes for this directive. 

ioop_paraiiei (ordered) is useful for manually parallelizing loops 
that contain ordered dependences. This is described in "Parallel 
synchronization," on page 243. 

Parallelizing loops with calls 

ioop_paraiiei is useful for manually parallelizing loops containing 
procedure calls. 

This is shown in the following Fortran code: 

C$DIR LOOP_PARALLEL 
DO I = 1, N 

X (I) = FUNC (I) 

ENDDO 

The call to func in this loop would normally prevent it from 
parallelizing. To verify that the func has no side effects, review the 
following conditions. A function does not have side effects if: 

• It does not modify its arguments. 

• It does not modify the same memory location from one call tothe 
next. 

• It performs no I/O. 
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• It does not call any procedures that have side effects. If func does 
have side effects or is not reentrant, this loop may yield wrong 
answers. 

If you are sure that 

FUNC 

has no side effects and is compiled for reentrancy (the default), this loop 
can be safely parallelized. 

In some cases, global register allocation can interfere with the routine being 
called. Refer to the “Global register allocation (GRA)” on page 43 for more 
information. 

Unparallelizable loops 

The compiler does not parallelize any loop that does not havea number 
of iterations that can be determined prior to loop invocation at execution 
time, even when ioop_paraiiei is specified. 

This is shown in the following Fortran code: 

C$DIR LOOP_PARALLEL 

DO WHILE(A(I) .GT. 0)!WILL NOT PARALLELIZE 


A (I) = 


ENDDO 

I n general, there is noway the compiler can determine the loop's 
iteration count prior to loop invocation here, so the loop cannot be 
parallelized. 
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prefer_parallel f loop_parallel attributes 

The prefer_paraiiei and ioop_paraiiei directives and pragmas 
maintain the same attributes. The forms of these directives and pragmas 
are shown in Table 29. 


Table 29 Forms of prefer_paraiiei and ioop_paraiiei di rectives and 

pragmas 


Language 

Form 

Fortran 

C$D IR PREFER_PARALLEL [ (attri bute-l ist) ] 


c$d ir loop_parallel [ (attri bute-l ist) ] 

C 

#pragma _CNX prefer_parallel [ (attribute-list) ] 


#pragma _CNX loop_parallel (ivar = indvar[, attri bute-l ist] ) 

NOTE 

where 

ivar =indvar 

specifies that the primary loop induction variable is 
indvar. ivar = indvar is optional in Fortran, but 
required in C. Use only with ioop_paraiiei. 

attribute-list 

can contain one of the case-insensitive attributes noted 
in Table 30. 

The values of n and m must be compile-time constants for the loop 
parallelization attributes in which they appear. 
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Table 30 Attributes for loop_parallel, prefer_parallel 


Attribute 

Description 

dist 

Causes the compiler to distribute the iterations of a 
loop across active threads instead of spawning new 
threads. This significantly reduces parallelization 
overhead. 

M ust be used with prefer_parallel or 
loop_parallel inside a parallel/end_parallel 

region. 

Can be used with any prefer_paraiiei or 
loop_parallel attribute, except threads. 

ordered 

Causes the iterations of the loop to be initiated in 
iteration order across the processors. This is useful 
only in loops with manually-synchronized dependences, 
constructed using ioop_paraiiei. 

To achieve ordered parallelism, dependences must be 
synchronized within ordered sections, constructed 
using the ordered_section and 
end_ordered_section directives. 

max_threads = m 

Restricts execution of the specified loop to no more 
than m threads if specified alone, m must bean integer 
constant. 

max_threads = m is useful when you know the 
maximum number of threads your loop runs on 
efficiently. 

If specified with the chunk_size = n attribute, the 
chunks are parallelized across no more than m threads. 
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Attribute 

Description 

chunk_size = n 

Divides the loop into chunks of n or fewer iterations by 
which to strip mine the loop for parallelization, n must 
be an integer constant. 

If chunk_size = n is present alone, n or fewer loop 
iterations are distributed round-robin to each available 
thread until there are no remaining iterations. This is 
shown in Table 32 and Table 33 on page 186. 

If the number of threads does not evenly divide the 
number of iterations, some threads perform one less 
chunk than others. 

dist, ordered 

Causes ordered invocation of each iteration across 
existing threads. 

dist, max_threads = m 

Causes thread-parallelism on no more than m existing 
threads. 

ordered, max_threads = m 

Causes ordered parallelism on no morethan m threads. 

dist, chunk_size = n 

Causes thread-parallelism by chunks. 

dist, ordered, max_threads 

= m 

Causes ordered thread-parallelism on no morethan m 
existing threads. 

chunk_size = n, 
max_threads = m 

Causes chunk parallelism on no morethan m threads. 

dist, chunk_size = n, 
max_threads = m 

Causes thread-parallelism by chunks on no more than 
m existing threads. 


Any loop under the influence of ioop_paraiiei (dist) or 
prefer_paraiiei (dist) appears in the Optimization Report as serial. 
This is because it is already inside a parallel region. You can generate an 
Optimization Report by specifying the +Oreport option. For more 
information, see "Optimization Report,” on page 151. 
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Combining the attributes 

Table 30 shown above describes the acceptable combinations of 
loop_parallel and prefer_parallel attributes. In such 
combinations, the attributes are listed in any order. 

The ioop_paraiiei C pragma requires the ivar = indvar attribute, 
which specifies the primary loop induction variable. If this is not present, 
the compiler issues a warning and ignores the pragma, ivar should 
specify only the primary induction variable. Any other loop induction 
variables should be a function of this variable and should be declared 
loop_private. 

I n Fortran, ivar is optional for do loops. If it is not provided, the 
compiler picks the primary induction variable for the loop, ivar is 
required for do, while and customized loops in Fortran. 

prefer_parallel does not require ivar. The compiler issues an error if 
it encounters this combination. 

Comparing prefer parallel, loop_parallel 

The prefer_paraiiei and ioop_paraiiei directives and pragmas 
are used to parallelize loops. Table 31 provides an overview of the 
differences between the two pragmas/directives. See the sections 
"prefer_parallel”on page 178 and "loop_parallel”on page 179for 
more information. 
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Table 31 Comparison Of loop_parallel and prefer_parallel 



prefer_parallel 

loop_parallel 

Description 

Requests compiler to perform 
parallelization analysis on the 
following loop then parallelize the 
loop if it is safe to do so. 

When used with the +Oautopar 
option (the default), it overrides 
the compiler heuristic for picking 
which loop in a loop nest to 
parallelize. 

When used with +Onoautopar, 
the compiler only performs 
directive-specified parallelization. 
No heuristic is used to pick the 
loop in a nest to parallelize. 1 n 

SUCh cases, prefer_parallel 
requests loop parallelization. 

Forces the compiler to parallelize 
the foil owing loop—assuming the 
iteration count can be determined 
prior to loop invocation. 

Advantages 

Compiler automatically performs 
parallelization analysis and 
variable privatization. 

Allows you to parallelize loops 
that the compiler is not able to 
automatically parallelize because 
it cannot determine dependences 
or side effects. 

Disadvantages 

Loop may or may not execute in 
parallel. 

Requires you to: 

—Check for and synchronize any 

data dependences 

—Perform variable privatization 
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Stride-based parallelism 

Stride-based parallelism differs from thedefault strip-based parallelism 
described in that: 

• Strip-based parallelism divides the loop’s iterations into a number of 
contiguous chunks equal tothe number of availablethreads, and each 
thread computes one chunk. 

• Stride-based parallelism, set by the chunk_size=n attribute, allows 
each thread to do several noncontiguous chunks. 

Specifying chunk_size = ((number of iterations -1) / number of 
threads) +1 is similar to default strip mining for parallelization. 

Using chunk_size = l distributes individual iterations cyclically 
across the processors. For example, if a loop has 1000 iterations to be 
distributed among 4 processors, specifying chunk_size=i causes the 
distribution shown in Table 32. 

Table 32 Iteration distribution using chunk_size = l 



CPUO 

CPU1 

CPU2 

CPU3 

Iterations 

1 

2 

3 

4 

5 

6 

7 



For chunk_size=n, with n >1, the distribution is round-robin. Flowever, 
it is not the same as specifying the ordered attribute. For example, 
using the same loop as above, specifying chunk_size=5 produces the 
distribution shown in Table 33. 


Table 33 Iteration distribution using chunk_size = 5 



CPUO 

CPU1 

CPU2 

CPU3 

Iterations 

1, 2, 3, 4, 5 

6, 7, 8, 9, 10 

11, 12, 13, 14, 15 

16, 17, 18, 19, 20 

21, 22, 23, 24, 25 

26, 27, 28, 29, 30 

31, 32, 33, 34, 35, 



For more information and examples on using the chunk_size = n 
attribute, see ‘Troubleshooting,”on page273. 
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Example 


Figure 16 


prefer_parallel, loop_parallel 

The following Fortran example uses the prefer_parallel directive, 
but applies to loop_parallel as well: 

C$DIR PREFER_PARALLEL(CHUNK_SIZE = 4) 

DO I = 1, 100 

A (I) = B (I) + C(I) 

ENDDO 

I n this example, the loop is parallelized by parcelling out chunks of four 
iterations to each avail able thread. Figure 16 uses Fortran array syntax 
to illustrate the iterations performed by each thread, assuming eight 
avail able threads. 

Figure 16 shows that the 100 iterations of i are parcelled out in chunks 
of four iterations to each of the eight available threads. After the chunks 
are distributed evenly to all threads, there is one chunk left over 
(iterations 97:100), which executes on thread 0. 

Stride-parallelized loop 


A(1:4)=B (1:4)+C (1 : 4) 

A(65:68)=B(65:68)+C(65: 68) 

A(97:100)=B(97:100)+C(97:100) 

THREAD 0 

A(9:12)=B(9:12)+C(9:12) 


A(73:76)=B(73:76)+C(73:76) 

THREAD 2 


A (17 : 

: 20) 

=B(17: 

: 2 0)+C (17: 

: 20) 

A(81 : 

: 84) 

=B (81: 

: 84)+C (81: 

: 84) 


THREAD 4 


A (25 : 

: 28) 

=B (25: 

: 2 8)+C(25: 

: 28) 

A ( 8 9 : 

: 92) 

=B (89: 

: 92)+C(89: 

: 92) 


THREAD 6 


A(5:8)=B(5:8)+C(5:8) 

A(69:72)=B (69:72)+C(69:72) 

THREAD 1 


A (13: 

: 16) 

=B(13: 

: 16)+C(13: 

: 16) 

A (77 : 

: 80) 

=B (77: 

: 80)+C(77: 

: 80) 


THREAD 3 


A (21: 

: 24) 

=B(21: 

: 24)+C(21: 

: 24) 

A (85 : 

: 88) 

=B (85: 

: 88)+C(85: 

: 88) 


THREAD 5 


A (2 9 : 

: 32) 

=B(29: 

: 32)+C(29: 

: 32) 

A (93 : 

: 96) 

=B (93: 

: 96)+C (93: 

: 96) 


THREAD 7 
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prefer_parallel, loop_parallel 

The chunk_size = n attribute is most useful on loops in which the 
amount of work increases or decreases as a function of the iteration 
count. These loops are also known as triangular loops. The following 
Fortran example shows such a loop. As with the previous example, 
prefer_parallel is used here, but the concept also applies to 

LOOP_PARALLEL. 

C$DIR PREFER_PARALLEL(CHUNK_SIZE = 4) 

DO J = 1, N 
DO I = J, N 
A (I, J) = ... 


ENDDO 

ENDDO 

Here, the work of the i loop decreases as j increases. By specifying a 
chunk_size for the j loop, the load is more evenly balanced across the 
threads executing the loop. 

If this loop was strip-mined in the traditional manner, the amount of 
work contained in the strips would decrease with each successive strip. 
The threads performing early iterations of j would do substantially more 
work than those performing later iterations. 
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Table 34 


critical_section, end_critical_section 

The critical_section and end_critical_section directives and 
pragmas allow you to specify sections of code in parallel loops or tasks 
that must be executed by only one thread at a time. These directives 
cannot be used for ordered synchronization within a 
ioop_paraiiei (ordered) loop, but are suitable for simple 
synchronization in any other ioop_paraiiei loops. Use the 
ordered_section and end_ordered_section directives or pragmas 
for ordered synchronization within a ioop_paraiiei (ordered) loop. 

A criticai_section directive or pragma and its associated 
end_criticai_section must appear in the same procedure and under 
the same control flow. They do not have to appear in the same procedure 
as the parallel construct in which they are used. For instance, the pair 
can appear in a procedure cal led from a parallel loop. 

The forms of these directives and pragmas are shown in 9. 


Forms of criticalsection/endcriticalsection directives and 
pragmas 


Language 

Form 

Fortran 

c$dir critical_section [ (gate) ] 


C$DIR END_CRITICAL_SECTION 

C 

#pragma _CNX critical_section [ (gate) ] 


#pragma _CNX end_critical_section 


The criticai_section directive/pragma can take an optional gate 
attribute that allows the declaration of multiple critical sections. This is 
described in "Using gates and barriers” on page 245. Only simple critical 
sections are discussed in this section. 
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critical_section 

Consider the following Fortran example: 

C$DIR LOOP_PARALLEL, LOOP_PRIVATE(FUNCTEMP) 
DO I = 1, N ! LOOP IS PARALLELIZABLE 


FUNCTEMP = FUNC(X(I)) 
C$DIR CRITICAL_SECTION 

SUM = SUM + FUNCTEMP 
C$DIR END_CRITICAL_SECTION 


ENDDO 

Because func has no side effects and is called in parallel, the i loop is 
parallelized as long as the sum variable is only updated by one thread at 
a time. The critical section created around sum ensures this behavior. 

The loop_parallel directive and the critical section directive are 
required to parallelize this loop because the cal I toFUNC would normally 
inhibit parallelization. If this call were not present, and if the loop did 
not contain other parallelization inhibitors, thecompiler would 
automatically parallelize the reduction of sum as described in the section 
"Reductions" on page 108. However, the presence of the cal I necessitates 
the loop_parallel directive, which prevents thecompiler from 
automatically handling the reduction. 

This, in turn, requires using either a critical section directive or the 
reduction directive. Placing the call toFUNC outside of the critical 
section allows func to be called in parallel, decreasing the amount of 
serial work within the critical section. 

I n order to justify the cost of the compiler-generated synchronization 
code associ ated with the use of critical sections, loops that contain them 
must also contain a large amount of paral lei izable (non-critical section) 
code. If you are unsure of the profitability of using a critical section to 
help parallelizea certain loop, time the loop with and without the critical 
section. This helps to determine if parallelization justifies the overhead 
of the critical section. 

For this particular example, the reduction directive or pragma could 
have been used in place of the criticai_section, 
end_criticai_section combination. For more information, seethe 
section "Reductions" on page 108. 
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Disabling automatic loop thread- 
parallelization 

You can disable automatic loop thread-parallelization by specifying the 
+Onoautopar option on the compiler command line. +Onoautopar is 
only meaningful when specified with the +Oparaiiei option at +03 
or +04. 

This option causes the compiler to parallelize only those loops that are 
immediately preceded by prefer_paraiiei or ioop_paraiiei. 
Becausethe compiler does not automatically find parallel tasks or 
regions, user-specified task and region parallelization is not affected by 
this option. 
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Parallelizing tasks 

The compiler does not automatically parallelize code outside a loop. 
However, you can use tasking directives and pragmas to instruct the 
compiler to parallelize this type of code. 

• The begin_tasks directive and pragma tells the compiler to begin 
parallelizing a series of tasks. 

• The next_task directi ve and pragma marks the end of a task and 

the start of the next task. 

• The end_tasks directive and pragma marks the end of a series of 
tasks to be parallelized and prevents execution from continuing until 
all tasks have completed. 

The sections of code delimited by these directives are referred to as a 
task list. Within a task list, the compiler does not check for data 
dependences, perform variable privatization, or perform parallelization 
analysis. You must manually synchronize any dependences between 
tasks and manually privatize data as necessary. 

The forms of these directives and pragmas are shown in Table 35. 


Forms of task parallelization directives and pragmas 


Language 

Form 

Fortran 

c$dir begin_tasks [ (attribute-list) ] 

C$DIR NEXT_TASK 

C$DIR END_TASKS 

C 

#pragma _CNX begin_tasks [(attribute-list) ] 

#pragma _CNX next_task 

#pragma _CNX end_tasks 
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where 

attribute-list 

can contain one of the case-insensitive attributes noted 
in Table 36. 

The optional attribute-list can contain one of the foil owing attribute 
combinations, with m being an integer constant. 


Table 36 Attributes for task parallelization 


Attri bute 

Description 

dist 

1 nstructs the compiler to distribute the tasks across the currently 
threads, instead of spawning new threads. 

Use with other valid attributes to begin_tasks insidea 
parallel/end_parallel region. begin_tasks and parallel/ 
end_paraiiei must appear insidethesamefunction. 

ordered 

Causes the tasks to be initiated in their lexical order. That is, the 
first task in the sequence begins to run on its respective thread 
before the second and soon. 

1 n the absence of the ordered argument, the starting order is 
indeterminate. While this argument ensures an ordered starting 
sequence, it does not provide any synchronization between tasks, 
and does not guarantee any particular ending order. 

You can manually synchronize the tasks, if necessary, as described 
in "Parallel synchronization,"on page243. 

max_threads = m 

Restricts execution of the specified loop to no more than m threads 
if specified alone or with the threads attribute, m must bean 
integer constant. 

max_threads = m is useful when you know the maximum 
number of threads on which your task runs efficiently. 

Can include any combination of thread-parallel, ordered or 
unordered execution. 

dist, ordered 

Causes ordered invocation of each task across threads, as specified 
in the attribute list to the parallel directive. 
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Attri bute 

Description 

dist, max_threads 

= m 

Causes thread-parallelism on no more than m existing threads. 

ordered, 

max_threads = m 

Causes ordered parallelism on no more than m threads. 

dist, ordered, 
max_threads = m 

Causes ordered thread-parallelism on nomorethan m existing 
threads. 


NOTE Do not use tasking directives or pragmas unless you have verified that 

dependences do not exist. You may insert your own synchronization code in 
the code delimited by the tasking directives or pragmas. The compiler will 
not performs dependence checking or synchronization on the code in these 
regions. Synchronization is discussed in “Parallel synchronization,” on 
page 243. 
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Example 


Parallelizing tasks 

The following Fortran example shows how to insert tasking directives 
into a section of code containing three tasks that can be run in parallel: 

C$DIR BEGIN_TASKS 

parallel task 1 

C$DIR NEXT_TASK 

parallel task 2 

C$DIR NEXT_TASK 

parallel task 3 

C$DIR END_TASKS 

The example above specifies thread-parallelism by default. The compiler 
transforms the code into a parallel loop and creates machine code 
equivalent to the foil owing Fortran code: 

C$DIR LOOP_PARALLEL 
DO 40 I = 1,3 

GOTO (10,20,30)1 

io parallel task 1 

GOTO 40 

20 parallel task 2 

GOTO 40 

30 parallel task 3 

GOTO 40 

40 CONTINUE 

If there are more tasks than available threads, some threads execute 
multi pie tasks. If there are more threads than tasks, some threads do not 
execute tasks. 

In thisexample, the end_tasks directiveand pragma acts as a barrier. 
All parallel tasks must complete before the code foil owing end_tasks 
can execute. 
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Parallelizing tasks 

Thefollowing C example illustrates how to use these directives to specify 
simple task-para 11 el i zati on: 

#pragma _CNX begin_tasks, task_private(i) 
for (i=0;i<n-l;i++) 

a [ i ] = a[i+1] + b [ i] ; 

#pragma _CNX next_task 
tsub(x,y); 

#pragma _CNX next_task 
for(i=0;i<500;i++) 
c [i*2] = d[i]; 

#pragma _CNX end_tasks 

I n this example, one thread executes the for loop, another thread 
executes the tsub (x,y) function call, and a third thread assigns the 
elements of the array d to every other element of c. These threads 
execute in parallel, but their starting and ending orders are 
indeterminate. 

The tasks are thread-parallelized. This means that there is no room for 
nested parallelization within the individual parallel tasks of this 
example, sotheforward LCD on the for i loop is inconsequential. There 
is no way for the loop to run but serially. 

The loop induction variable i must be manually privatized here because 
it is used to control loops in two different tasks. If i were not private, 
both tasks would modify it, causing wrong answers. The task_private 
directive and pragma isdescribed in detail in the section 
"task_private" on page 227. 

Task parallelism can become even more involved, as described in 
"Parallel synchronization,"on page243. 
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Parallelizing regions 

A parallel region is a single block of codethat is written to run replicated 
on several threads. Certain scalar code within the parallel region is run 
by each thread in preparation for work-sharing parallel constructs such 
as prefer_parallel(dist), loop_parallel(dist ), or 
begin_tasks (dist) . Thescalar code typical ly assigns data into 
paraiiei_private variables so that subsequent references to the data 
have a high cache hit rate. Within a parallel region, code execution can 
derestricted to subsets of threads by using conditional blocks that test 
the thread ID. 

Region parallelization differs from task parallelization in that parallel 
tasks are separate, contiguous blocks of code. When parallelized using 
the tasking directives and pragmas, each block generally runs on a 
separate thread. This is in comparison to a single parallel region, which 
runs on several threads. 

Specifying parallel tasks is also typically less time consuming because 
each thread's work is implicitly defined by the task boundaries. I n region 
parallelization, you must manually modify the region to identify 
thread-specific code. However, region parallelism can reduce 
parallelization overhead as discussed in the section explaining the dist 
attribute. 

The beginning of a parallel region is denoted by the parallel directive 
or pragma. The end is denoted by theend_paraiiei directive or 
pragma. end_paraiiei also prevents execution from continuing until 
all copies of the parallel region have completed. 

Within a parallel region, the compiler does not check for data 
dependences, perform variable privatization, or perform parallelization 
analysis. You must manually synchronize any dependences between 
copies of the region and manually privatize data as necessary. I n the 
absence of a threads attribute, parallel defaults to thread 
parallelization. 
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The forms of the regional parallelization directives and pragmas are 
shown in Table 37. 

Table 37 Forms of region parallelization directives and pragmas 


Language 

Form 

Fortran 

c$dir parallel [ (attribute-list) ] 


C$DIR END_PARALLEL 

C 

#pragma _CNX parallel (attribute-list) 


#pragma _CNX end_parallel 


The optional attribute-list can contain one of the foil owing attributes (m 
is an integer constant). 


Table 38 Attributes for region parallelization 


Attri bute 

Description 

max_threads = m 

Restricts execution of the specified region to no more than m 
threads if specified alone or with the threads attribute, m must be 
an integer constant. 

Can include any combination of ordered, or unordered execution. 


WARNING Do not use the parallel region directives or pragmas unless you ensure that 

dependences do not exist or you insert your own synchronization code, if 
necessary, in the region. The compiler performs no dependence checking or 
synchronization on the code delimited by the parallel region directives and 
pragmas. Synchronization is discussed in “Parallel synchronization,” on 
page 243. 
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Example 


Region parallelization 

The following Fortran example provides an implementation of region 
parallelization using the parallel directive: 

REAL A(1000,8), B(1000,8), C(1000,8), RDONLY(1000), SUM(8) 
INTEGER MYTID 


C FIRST INITIALIZATION OF RDONLY IN SERIAL CODE: 

CALL INIT1(RDONLY) 

IF(NUM_THREADS() .LT. 8) STOP "NOT ENOUGH THREADS; EXITING" 

C$DIR PARALLEL(MAX_THREADS = 8), PARALLEL_PRIVATE(I, J, K, MYTID) 
MYTID = MY_THREAD() + 1 !ADD 1 FOR PROPER SUBSCRIPTING 
DO I = 1, 1000 

A(I, MYTID) = B(I, MYTID) * RDONLY(I) 

ENDDO 

IF(MYTID .EQ. 1) THEN ! ONLY THREAD 0 EXECUTES SECOND 
CALL INIT2(RDONLY) ! INITIALIZATION 
ENDIF 

DO J = 1, 1000 

B(J, MYTID) = B(J, MYTID) * RDONLY(J) 

C(J, MYTID) = A(J, MYTID) * B(J, MYTID) 

ENDDO 

DO K = 1, 1000 

SUM(MYTID) = SUM(MYTID) + A(K,MYTID) + B(K,MYTID) + 

C(K,MYTID) 

ENDDO 

C$D IR END_PARALLEL 

I n this example, all arrays written to in the parallel code have one 
dimension for each of the anticipated number of parallel threads. Each 
thread can work on disjoint data, there is no chance of two threads 
attempting to update the same element, and, therefore, there is no need 
for explicit synchronization. The rdonly array is one-dimensional, but it 
is never written to by parallel threads. Before the parallel region, 
rdonly is initialized in serial code. 

The parallel_private directive is used to privatize the induction 
variables used in the parallel region. This must be done so that the 
various threads processing the region donot attempt to write to the same 
shared induction variables. parallel_private is covered in more 
detail in the section "paraiiei_private"on page 229. 

At the beginning of the parallel region, the num_threads () intrinsic is 
called toensurethat theexpected number of threads areavailable. Then 
the my_thread () intrinsic, is called by each thread to determine its 
thread I D. All subsequent code in the region is executed based on this I D. 

I n the i loop, each thread computes one row of a using rdonly and the 
corresponding row of b. 
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Parallelizing regions 


rdonly is reinitialized in a subroutine call that is only executed by 
thread 0 before it is used again in the computation of b in the j loop. I n 
j, each thread computes a row again. The j loop similarly computes c. 

Finally, theK loop sums each dimension of a, b, and c into the sum array. 
No synchronization is necessary here because each thread is running the 
entire loop serially and assigning into a discrete element of sum. 
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Reentrant compilation 

By default, HP-UX parallel compilers compile for reentrancy in that the 
compiler itself does not introduce static or global references beyond what 
exist in theoriginal code. Reentrant compilation causes procedures to 
store uninitialized local variables on the stack. Nolocalscan carry values 
from one invocation of the procedure to the next, unless the variables 
appear in Fortran common blocks or data or save statements or in C/ 
C-H- static statements. This allows loops containing procedure calls to 
be manually parallelized, assuming no other inhibitors of parallelization 
exist. 

When procedures arecalled in parallel, each thread receives a private 
stack on which to allocate local variables. This allows each parallel copy 
of the procedure to manipulate its local variables without interfering 
with any other copy's locals of the same name. When the procedure 
returns and the parallel threads join, all values on the stack are lost. 
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Setting thread default stack size 


Setting thread default stack size 

Thread O’s stack can grow to the size specified in themaxssiz 
configurable kernel parameter. Refer to the Managing Systems and 
Workgroups manual for more information on configurable kernel 
parameters. 

Any threads your program spawns (as the result of ioop_paraiiei or 
tasking directives or pragmas) receive a default stack size of 80 M bytes. 
This means that if the foil owing conditions exist, then you must modify 
the stack size of the spawned threads using the cps_stack_size 
environment variable: 

• A parallel construct declares more than 80 M bytes of ioop_private, 
task_private, or parallel_private data, or 

• A subprogram with more than 80 M bytes of local data is called in 
parallel, or 

• The cumulative size of all local variables in a chain of subprograms 
called in parallel exceeds 80 Mbytes, 

Modifying thread stack size 

Under csh, you can modify the stack size of the spawned threads using 
the cps_stack_size environment variable. 

The form of the cps_stack_size environment variable is shown in 
Table 39. 


Forms of cps_stack_size environment variable 


Language 

Form 

Fortran, C 

setenv cps_stack_size size_in_kbytes 


where 

si zej n_kbytes 

isthedesired stack sizein kbytes. This valueis read at 
program start-up, and it cannot be changed during 
execution. 
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Collecting parallel information 

For example, the following command sets thethread stack size 
to 100 M bytes: 

setenv CPS_STACK_SIZE 102400 


Collecting parallel information 

Several intrinsics are availableto provide information regarding the 
parallelism or potential parallelism of your program. These are all 
integer functions, available in both 4- and 8-byte variants. They can 
appear in executable statements anywhere an integer expression is 
allowed. 

The 8-byte functions, which are suffixed with _ 8 , are typically only used 
in Fortran programs in which the default data lengths have been 
changed using the -is or similar compiler options. When default integer 
lengths are modified via compiler options in Fortran, thecorrect intrinsic 
is automatically chosen regardless of which is specified. These versions 
expect 8-byte input arguments and return 8-byte values. 

All C/C++ code examples presented in this chapter assume that the line 
below appears above the C code presented. This header file contains the 
necessary type and function definitions. 

#include <spp_prog_model.h> 

Number of processors 

Certain functions return the total number of processors on which the 
process has initiated threads. These threads are not necessarily active at 
the time of the call. The forms of these functions are shown in Table 40. 
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Table 40 


Table 41 


Number of processors functions 


Language 

Form 

Fortran 

INTEGER NUM_PR0CS() 


INTEGER*8 NUM_PR0CS_8() 

C/C++ 

int num_procs(void) ; 


long long num_procs_8(void) ; 


num_procs is used to dimension automatic and adjustable arrays in 
Fortran. It may be used in Fortran, C, and C++ to dynamically specify 
array dimensions and allocate storage. 

Number of threads 

Certain functions return thetotal number of threads the process creates 
at initiation, regardless of how many are idle or active. The forms of 
these functions is shown in Table41. 

Number of threads functions 


Language 

Form 

Fortran 

INTEGER NUM_THREADS() 


INTEGER*8 NUM_THREADS_8() 

C/C+F 

int num_threads(void) ; 


long long num_threads_8(void) ; 


The return value differs from num_procs only if threads are 
oversubscribed. 
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Thread ID 

When called from parallel code these functions return thespawn thread 
ID of the calling thread, in the range 0..N-1, where nst is the number of 
threads in thecurrent spawn context (the number of threads spawned by 
the last parallel construct). Use them when you wish to direct specific 
tasks to specific threads inside parallel constructs. The forms of these 
functions is shown in Table 42. 


Thread ID functions 


Language 

Form 

Fortran 

INTEGER MY_THREAD() 


INTEGER*8 MY_THREADS_8 () 

C/C++ 

int my_thread(void) ; 


long long my_thread_8(void) ; 


When called from serial code, these functions return 0. 

Stack memory type 

These functions return a value representing the memory class that the 
current thread stack is allocated from. The thread stack holds all the 
procedure-local arrays and variables not manually assigned a class. On a 
single-node system, the thread stack is created in node_private 
memory by default. The forms of these functions is shown in Table 43. 

Stack memory type functions 


Language 

Form 

Fortran 

INTEGER MEMORY_TYPE_OF_STACK() 


INTEGER* 8 MEMORY_TYPE_OF_STACK_8 () 

C/C++ 

int memory_type_of_stack(void) ; 


long long memory_type_of_stack_8(void); 
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OpenMP Parallel Programming 
Model 


Thischapter discusses HP's implementation of theOpenMP vl.l parallel 
programming model, including OpenM P directives and command line 
options in thef90 front end and bridge. Topics covered include: 

• What is OpenM P? 

• HP's implementation of OpenM P 

• From H P Programming Model (H PPM) to OpenM P 
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What is OpenMP? 

OpenM Pisa portable, scalable model that gives shared-memory parallel 
programmers a simpleand flexible interface for developing parallel 
applications on platforms ranging from the desktop to the 
supercomputer. The OpenM P Application Program I nterface (API) 
supports multi-platform shared-memory parallel programming in 
Fortran on all architectures, including UNIX and Windows NT. 
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HP's implementation of OpenMP 

This section discusses HP’s implementation of OpenMP. 

Command-line option 

HP OpenM P directives are only accepted if the command-line option— 
+Oopenmp —is given. 

NOTE 40openmp implies -K)nodynsel and +Oparallel. 


Default 

The default command-line option is -tOnoopenmp. If +oopenmp is not 
given, all OpenMP directives (c$omp) are ignored. 

Opt levels and parallelism 

+Oopenmp is accepted at all opt levels. However, for parallel and work- 
shared directives (including the clauses for these directives), codeisonly 
parallelized at opt levels+03 or+04. The parallel and work-shared 
directives are listed in Table44, "Parallel and work-shared directives." 

Using opt levels -fOO through -f02 

When using opt levels +00 through +02: 

• All sync and run-time library directives are processed and honored. 

• Parallel and work-shared directives (including the clauses for these 
directives) areonly processed. Whilethey will return right answers, 
you will not achieve parallel code. Each thread will run a serial 
version of the code. 

Using opt levels +03 through +04 

When using opt levels +03 and +04: 

• All sync and run-time library directives are processed and honored. 

• Parallel and work-shared directives (including the clauses for these 
directives) are processed and honored. The compiler will generatethe 
parallel and work-shared code required to go parallel. 
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Table 44 


Parallel and work-shared directives 


Parallel / work-shared 
directives 

Opt level 
accepted 

Opt level 
required to 
achieve 
parallelism 

PARALLEL 

400, 401, 02 

403, 404 

PARALLEL DO 

400, 401, 02 

403, 404 

PARALLEL SECTIONS 

400, 401, 02 

403, 404 

DO 

400, 401, 02 

403, 404 

SECTION 

400, 401, 02 

403, 404 

SECTIONS 

400, 401, 02 

403, 404 


Parallel regions 

HP does not honor nesting of parallelism. If two parallel directives are 
encountered in thesame loop nest, one will be ignored by the compiler. A 
warning is issued by the compiler as to which parallel directive is 
ignored. 
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Table 45 


From HP Programming Model to 
OpenMP 

This section discusses migration from the H P Programming Model 
(HPPM) totheOpenMP parallel programming model. 

Syntax 

The OpenMP parallel programming model is very similar to the H P 
Programming Model (HPPM). Thegeneral thread model is thesame, the 
spawn (fork) mechanisms behave in a similar fashion, etc. However, the 
specific syntax to specify the underlying semantics has been changed 
significantly. 

ThefollowingtableshowstheOpenM P directiveor clause (relative to the 
directive) and the equivalent HPPM directiveor clausethat implements 
the same functionality. Certain clauses are valid on multiple directives, 
but are typically listed only once unless there is a distinction warranting 
further explanation. 


Exceptions are defined immediately following the table. 

OpenMP and HPPM Directives/Clauses 


HPPM 

OpenMP 

!$dir parallel 

task_private(list) 

<'shared' is default> 

<None, see below> 

!$OMP parallel 
private (list) 
shared (list) 

default (private|shared|none) 

!$dir loop_parallel(dist) 
blocked(chunkconstant) 
ordered 

!$OMP do 

schedule(static[,chunkconstant]) 
ordered 

!$dir begin_tasks(dist) 

!$OMP sections 

!$dir next_task 

!$OMP section 

!$dir loop_parallel 
<see parallel and 
loop_parallel(dist) clauses> 

!$OMP parallel do 

<see parallel and do clauses> 

!$dir begin_tasks 
<see parallel and 
begin_tasks(dist) clauses> 

!$OMP parallel sections 

<see parallel and sections clauses> 

!$dir critical_section[(name)] 

!$OMP critical[(name)] 
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From HP Programming Model to OpenMP 


HPPM 

OpenMP 

!$dir wait_barrier 

!$OMP barrier 

!$dir ordered_section 

!$OMP ordered 

<none> 

!$OMP end parallel 

!$dir end_tasks 

!$OMP end sections 

!$dir end_tasks 

!$OMP end parallel sections 

<none> 

!$OMP end parallel do 

!$dir end_critical_section 

!$OMP end critical 

!$dir end_ordered_section 

!$OMP end ordered 

<none> 

!$OMP end do 


Exceptions 

• private(list) /loop_private(list) 

OpenM P allows the induction variableto be a member of the variable 
list. H PPM does not. 

• default(private|shared|none) 

The H PPM defaults to "s hared” and al lows the user to specify which 
variables should be private. The HP model does not provide "none”; 
therefore, undeclared variables will betreated as shared. 

HP Programming Model directives 

This section describes how the H P Programming Model (H PPM) 
directives are affected by the implementation of OpenMP. 

Not Accepted with +Oopenmp 

These HPPM directives will not be accepted when +Oopenmp is given. 

• parallel 

• end_parallel 

• loop_parallel 

• prefer_parallel 

• begin_tasks 
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From HP Programming Model to OpenMP 


• next_task 

• end_tasks 

• critical_section 

• end_critical_section 

• ordered_section 

• end_ordered_section 

• loop_private 

• parallel_private 

• task_private 

• save_last 

• reduction 

• dynsel 

• barrier 

• gate 

• sync_routine 

• thread_private 

• node_private 

• thread_private_pointer 

• node_private_pointer 

• near_shared 

• far_shared 

• block_shared 

• near_shared_pointer 

• far_shared_pointer 

NOTE I f -fOopenmp is given, the directives above are ignored. 

Accepted with +Oopenmp 

These H PPM directives will continue to be accepted when +Oopenmp is 
given. 
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• options 

• no_dynsel 

• no_unroll_and_jam 

• no_parallel 

• no_block_loop 

• no_loop_transform 

• no_distribute 

• no_loop_dependence 

• scalar 

• unroll_and_jam 

• block_loop 
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More information on OpenMP 

For more information on OpenM P, seewww.openmp.org. 
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Data privatization 


Once HP shared memory cl asses are assigned, they are implemented 
throughout your entire program. Very efficient programs are written 
using these memory classes, as described in "Memory cl asses,"on 
page 233. However, these programs also require some manual 
intervention. Any loops that manipulate variables that are explicitly 
assigned to a memory class must be manually parallelized. Once a 
variable is assigned a class, its class cannot change. 

This chapter describes the workarounds provided by the HP Fortran and 
C compilers to support: 

• Privatizing loop variables 

• Privatizing task variables 

• Privatizing region variables 
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Directives and pragmas for data 
privatization 

This section describes the various directives and pragmas that are 
implemented to achieve data privatization. These directives and 
pragmas are discussed in Table 46. 


Table 46 Data Privatization Directives and Pragmas 


Directive/ Pragma 

Description 

Level of 
parallelism 

loop_private 
(namelist) 

Declares a list of variables and/or arrays 
privatetothefollowing loop. 

Loop 

parallel_private 
(namelist) 

Declares a list of variables and/or arrays 
privatetothefollowing parallel region. 

Region 

save_last[(list)] 

Specifies that the variables in the comma- 
delimited list (alsonamed in an associated 
loop_private (namelist ) directive or 
pragma) must have their values saved into 
the shared variable of the same name at loop 
termination. 

Loop 

task_private 
(namelist) 

Privatizes the variables and arrays specified 
in namelist for each task specified in the 
following begin_tasks/end_tasks block. 

Task 


These directives and pragmas allow you to easily and temporarily 
privatize parallel loop, task, or region data. When used with 
prefer_paraiiei, these directives and pragmas do not inhibit 
automatic compiler optimizations. This facilitates increased performance 
of your shared-memory program. It occurs with less work than is 
required when using the standard memory cl asses for manual 
parallelization and synchronization. 

The data privatization directives and pragmas are used on local 
variables and arrays of any type, but they should not be used on data 
assigned to thread_private. 
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In some cases, data declared ioop_private, task_private, or 
paraiiei_private is stored on the stacks of the spawned threads. 
Spawned thread stacks default to 80 M bytes in size. 
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Data privatization 

Privatizing loop variables 


Privatizing loop variables 

This section describes the following directives and pragmas associated 
with privatizing loop variables: 

• loop_private 

• save_last 


loop_private 

The ioop_private directive and pragma declares a list of variables 
and/or arrays privatetothe immediately following Fortran do or C for 
loop. ioop_private array dimensions must be identifiable at compile¬ 
time. 

The compiler assumes that data objects declared to be ioop_private 
have no loop-carried dependences with respect tothe parallel loops in 
which they are used. I f dependences exist, they must be handled 
manually using the synchronization directives and techniques described 
in "Parallel synchronization,”on page243. 

Each parallel thread of execution receives a private copy of the 
ioop_private data object for the duration of the loop. No starting 
values are assumed for the data. Unlessa save_iast directive or 
pragma is specified, no ending value is assumed. If a ioop_private 
data object is referenced within an iteration of the loop, it must be 
assigned a value previously on that same iteration. 

The form of this directive and pragma is shown in Table47. 


Form of ioop_private directive and pragma 


Language 

Form 

Fortran 

c$dir loop_private (namelist) 

C 

#pragma _CNX loop_private (namel ist) 


where 
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namelist is a comma-separated list of variables and/or arrays 

that are to beprivatetotheimmediatelyfollowing loop, 
namelist cannot contain structures, dynamic arrays, 
allocatable arrays, or automatic arrays. 

loop_private 

The following is a Fortran example of ioop_private: 

C$DIR LOOP_PRIVATE(S) 

DO I = 1, N 

C S IS ONLY CORRECTLY PRIVATE IF AT LEAST 

C ONE IF TEST PASSES ON EACH ITERATION: 

IF (A (I) . GT. 0) S = A (I) 

IF (U (I) .LI. V (I) ) S = V (I) 

IF (X (I) .LE . Y (I) ) S = Z (I) 

B (I) = S * C (I) + D (I) 

ENDDO 

A potential loop-carried dependence on s exists in this example. If none 
of the if tests are true on a given iteration, the value of s must wrap 
around from the previous iteration. The loop_private (S) directive 
indicates to the compiler that s does, in fact, get assigned on every 
iteration, and therefore it is safe to paral lei ize this loop. 

If on any iteration none of the if tests pass, an actual LCD exists and 
privatizing s results in wrong answers. 

Using loop_private with loop_parallel 

Because the compiler does not automatically perform variable 
privatization in ioop_paraiiei loops, you must manually privatize 
loop data requiring privatization. This is easily done using the 
ioop_private directive or pragma. 

The following Fortran example shows how ioop_private manually 
privatizes loop data: 

SUBROUTINE PRIV(X,Y,Z) 

REAL X (10 00), Y (4, 1000), Z(1000) 

REAL XMFIED(1000) 

C$DIR LOOP_PARALLEL, LOOP_PRIVATE(XMFIED, J) 

DO I = 1, 4 

C INITIALIZE XMFIED; MFY MUST NOT WRITE TO X: 

CALL MFY(X, XMFIED) 

DO J = 1, 999 

IF (XMFIED(J) .GE. Y(I,J)) THEN 
Y(I,J) = XMFIED(J) * Z(J) 

ELSE 

XMFIED(J+l) = XMFIED(J) 

ENDIF 
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ENDDO 

ENDDO 

END 

Here, the loop_parallel directive is required to parallelize the i loop 
because of the call toMFY. The x and y arrays are in shared memory by 
default, x and z are not written to, and the portions of y written to in the 
j loop’s if statement are disjoint, so these shared arrays require no 
special attention. The local array xmfied, however, is written to. But 
because xmfied carries no values into or out of the i loop, it is privatized 
using loop_private. This gives each thread running the i loop its own 
privatecopy of xmfied, eliminating the expensive necessity of 
synchronized access to xmfied. 

Notethat an LCD exists for xmfied in the j loop, but because this loop 
runs serially on each processor, the dependence is safe. 

Denoting induction variables in parallel loops 

Tosafely parallelizea loop with the ioop_paraiiei directive or 
pragma, the compiler must be able to correctly determine the loop's 
primary induction variable. 

The compiler can find primary Fortran do loop induction variables. It 
may, however, have trouble with do while or customized Fortran loops, 
and with all ioop_paraiiei loops in C. Therefore, when you use the 
ioop_paraiiei directiveor pragma to manually parallelizea loop other 
than an explicit Fortran do loop, you should indicate the loop's primary 
induction variable using the ivAR=indvar attribute to ioop_paraiiei. 

Denoting induction variables in parallel loops 

Consider the following Fortran example: 

1 = 1 

C$DIR LOOP_PARALLEL(IVAR = I) 

10 A(I) = . . . 

! ASSUME NO DEPENDENCES 

1 = 1 + 1 

IF(I .LE. N) GOTO 10 

The above is a customized loop that uses i as its primary induction 
variable. To ensure parallelization, the loop_parallel directive is 
placed immediately before the start of the loop, and the induction 
variable, i, is specified. 

Denoting induction variables in parallel loops 
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Primary induction variables in C loops aredifficult for the compiler to 
find, so ivar is required in all ioop_paraiiei C loops. Its use is shown 
in the following example: 

#pragma _CNX loop_parallel(ivar=i) 
for(i=0; i<n; i++) { 

a [i] = . . . ; 

. /* assume no dependences */ 



Secondary induction variables 

Secondary induction variables arevariables used totrack loop iterations, 
even though they do not appear in the Fortran do statement. They 
cannot appear in addition to the primary induction variable in theC for 
statement. 

Such variables must be a function of the primary loop induction variable, 
and they cannot be independent. Secondary induction variables must be 
assigned ioop_private. 

Secondary induction variables 

The following Fortran example contains an incorrectly incremented 
secondary induction variable: 

C WARNING: INCORRECT EXAMPLE!!!! 

J = 1 

C$DIR LOOP_PARALLEL 
DO I = 1, N 

J = J + 2 ! WRONG!!! 

I n this example, j does not produce expected values in each iteration 
because multiple threads are overwriting its value with no 
synchronization. The compiler cannot privatize j because it is a loop- 
carried dependence (LCD). This example is corrected by privatizing j 
and making it a function of i, as shown below. 

C CORRECT EXAMPLE: 

J = 1 

C$DIR LOOP_PARALLEL 

C$DIR LOOP_PRIVATE(J) ! J IS PRIVATE 
DO I = 1, N 

J = (2*1)+1 ! J IS PRIVATE 

As shown in the preceding example, j is assigned correct values on each 
iteration because it is a function of i and is safely privatized. 

Secondary induction variables 
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I n C, secondary induction variables are sometimes included in for 
statements, as shown in the foil owing example: 

/* warning: unparallelizable code follows */ 

#pragma _CNX loop_parallel(ivar=i) 
for(i=j=0; i<n;i + +,j+=2) { 

a [ i ] = . . . ; 


Because secondary induction variables must be private to the loop and 
must be a function of the primary induction variable, this example 
cannot be safely parallelized using ioop_paraiiei (ivar=i) . I n the 
presence of this directive, thesecondary induction variable is not 
recognized. 

To manually paralleiize this loop, you must remove j from the for 
statement, privatize it, and make it a function of i. 

The following example demonstrates how to restructure the loop sothat 
j is a valid secondary induction variable: 

#pragma _CNX loop_parallel(ivar=i) 

#pragma __CNX loop_private(j) 
for(i=0; i<n; i + + ) { 

j = 2*i; 
a [ i ] = . . . ; 


This method runs faster than placing j in a critical section because it 
requires no synchronization overhead, and the private copy of j used 
here can typically be more quickly accessed than a shared variable. 

save_last [ (list) ] 

A save_iast directive or pragma causes the thread that executes the 
last iteration of the loop to write back the private (or local) copy of the 
variable into the global reference. 

The s ave_last di recti ve and pragma a 11 ows you to save thefinal value 

of ioop_private data objects assigned in the last iteration of the 
immediately following loop. 
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Table 48 


• If list (theoptional, comma-separated list of ioop_private data 
objects) is specified, only the final values of those data objects in list 
are saved. 

• If list is not specified, the final values of all ioop_private data 
objects assigned in the last loop iteration are saved. 

The values for this directive and pragma must be assigned in the last 
iteration. If the assignment is executed conditionally, it is your 
responsibility to ensure that the condition is met and the assignment 
executes. I naccurate results may occur if the assignment does not 
execute on the last iteration. For ioop_private arrays, only those 
elements of the array assigned on the last iteration are saved. 

The form of this directive and pragma is shown in Table48. 


Form of save_iast directive and pragma 


Language 

Form 

Fortran 

C$DIR SAVE_LAST [ (list) ] 

C 

#pragma _CNX save_last [ (list) ] 


save_iast must appear immediately before or after the associated 
ioop_private directive or pragma, or on thesame line. 

save_last 

The following is a C example of save_iast: 

#pragma _CNX loop_parallel(ivar=i) 

#pragma __CNX loop_private(atemp, x, y) 

#pragma __CNX save_last(atemp, x) 
for(i=0;i<n;i++) { 

if(i==d[i]) atemp = a[i]; 
if(i==e[i]) atemp = b[i]; 
if(i==f[i]) atemp = c[i]; 
a [ i ] = b [ i ] + c [ i ] ; 
b[i] = atemp; 
x = atemp * a[i] ; 
y = atemp * c[i ] ; 


if(atemp > amax) { 
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I n this example, the ioop_private variable atemp is conditionally 
assigned in the loop. I n order for atemp to betruly private, you must be 
sure that at least one of the conditions is met so that atemp is assigned 
on every iteration. 

When the loop terminates, the save_iast pragma ensures that atemp 
and x contain the values they are assigned on the last iteration. These 
values can then be used later in the program. The value of y, however, is 
not avail able once the loop finishes because y is not specified as an 
argument to save_iast. 

save_last 

There are some loop contexts in which the save_iast directive and 
pragma is misleading. 

The following Fortran code provides an example of this: 

C$DIR LOOP_PARALLEL 
C$DIR LOOP_PRIVATE(S) 

C$DIR SAVE_LAST 

DO I = 1, N 

IF(G(I) .GT. 0) THEN 
S = G (I) * G (I) 

END IF 
ENDDO 

While it may appear that the last value of s assigned is saved in this 
example, you must remember that the save_last directive applies only 
tothe last (Nth) iteration, with no regard for any conditionals contained 
in the loop. For save_last to be valid here, g (n) must be greater than 0 
sothat the assignment to s takes place on the final iteration. 

Obviously, if this condition is predicted, the loop is more efficiently 
written to exclude the if test, so the presence of a save_last in such a 
loop is suspect. 
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Table 49 


Privatizing task variables 

Task privatization is manually specified using the task_private 
directive and pragma. task_private declares a list of variables and/or 
arrays private to the immediately following tasks. It serves thesame 
purpose for parallel tasks that ioop_private serves for loops and 
paraiiei_private serves for regions. 


task__private 

The task_private directive must immediately precede, or appear on 
the same line as, its corresponding begin_tasks directive. The compiler 
assumes that data objects declared to be task_private have no 
dependences between thetasks in which they are used. I f dependences 
exist, you must handlethem manually using the synchronization 
directives and techniques described in "Parallel synchronization," on 
page 243. 

Each parallel thread of execution receives a private copy of the 
task_private data object for the duration of thetasks. No starting or 
ending values are assumed for the data. If a task_private data object 
is referenced within a task, it must have been previously assigned a 
value in that task. 

The form of this directive and pragma is shown in Table49. 


Form of task_private directive and pragma 


Language 

Form 

Fortran 

C$DIR TASK_PRIVATE (namelist) 

C 

#pragma _CNX task_private (namel ist) 


where 

namelist is a comma-separated list of variables and/or arrays 

that are to be private to the immediately following 
tasks, namelist cannot contain dynamic, allocatable, or 
automatic arrays. 

task_private 
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Privatizing task variables 

The following Fortran code provides an example of task privatization: 

REAL*8 A(1000), B(1000), WRK(IOOO) 


C$DIR BEGIN_TASKS, TASK_PRIVATE(WRK) 
DO I = 1, N 
WRK(I) = A(I) 

ENDDO 

DO I = 1, N 

A(I) = WRK(N +1-1) 


ENDDO 

C$DIR NEXT_TASK 

DO J = 1, M 
WRK (J) = B (J) 
ENDDO 

DO J = 1, M 

B(J) = WRK(M+l-J) 


ENDDO 

C$DIR END_TASKS 

I n this example, thewRK array is used in thefirst task to temporarily 
hold the a array so that its order is reversed. 11 serves the same purpose 
for the b array in the second task, wrk is assigned before it is used in 
each task. 
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Table 50 


Privatizing region variables 

Regional privatization is manually specified using the 
parallel_private directive or pragma. parallel_private is 
provided to declare a list of variables and/or arrays private to the 
immediately following parallel region. It serves the same purpose for 
parallel regions as task_private does for tasks, and ioop_private 
does for loops. 

parallel__private 

The paraiiei_private directive must immediately precede, or appear 
on the same line as, its corresponding parallel directive. Using 
paraiiei_private asserts that there are no dependences in the 
parallel region. 

Do not use paraiiei_private if there are dependences. 

Each parallel thread of execution receives a private copy of the 
paraiiei_private data object for the duration of the region. No 
starting or ending values are assumed for the data. If a 
paraiiei_private data object is referenced within a region, it must 
have been previously assigned a value in the region. 

The form of this directive and pragma is shown in Table50. 


Form of paraiiei_private directive and pragma 


Language 

Form 

Fortran 

C$DIR PARALLEL_PRIVATE (namelist) 

C 

#pragma _CNX parallel_private (namelist) 


where 

namelist is a comma-separated list of variables and/or arrays 

that are to be private to the immediately following 
parallel region, namelist cannot contain dynamic, 
allocatable, or automatic arrays. 

parallel_private 
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The following Fortran code shows how paraiiei_private privatizes 
regions: 

REAL A(1000,8), B(1000,8), C(1000,8), AWORK(IOOO), SUM(8) 
INTEGER MYTID 


C$DIR PARALLEL(MAX_THREADS = 8) 

C$DIR PARALLEL_PRIVATE(I,J,K,L,M,AWORK,MYTID) 

IF(NUM_THREADS() .LT. 8) STOP "NOT ENOUGH THREADS; EXITING" 

MYTID = MY_THREAD() + 1 !ADD 1 FOR PROPER SUBSCRIPTING 

DO I = 1, 1000 

AWORK(I) = A(I, MYTID) 

ENDDO 

DO J = 1, 1000 

A(J, MYTID) = AWORK(J) + B(J, MYTID) 

ENDDO 

DO K = 1, 1000 

B (K, MYTID) = B (K, MYTID) * AWORK (K) 

C (K, MYTID) = A (K, MYTID) * B (K, MYTID) 

ENDDO 

DO L = 1, 1000 

SUM(MYTID) = SUM(MYTID) + A(L,MYTID) + B(L,MYTID) + 

C(L,MYTID) 

ENDDO 

DO M = 1, 1000 

A (M, MYTID) = AWORK (M) 

ENDDO 

C$D IR END_PARALLEL 

This example checks for a certain number of threads and divides up the 
work among those threads. Theexample additionally introduces the 

parallel_private variable AWORK. 

Each thread initializes its private copy of awork to the values contained 
in adimension of the array a at the beginning of the parallel region.This 
allows the threads to reference awork without regard to thread ID. This 
is because no thread can access any other thread's copy of awork. 
Because awork cannot carry values into or out of the region, it must be 
initialized within the region. 

Induction variables in region privatization 

All induction variables contained in a parallel region must be privatized. 
Code contained in the region runs on all available threads. Failing to 
privatize an induction variable would allow each thread toupdatethe 
same shared variable, creating indeterminate loop counts on every 
thread. 
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I n the previous example, in the j loop, after awork is initialized, awork 
is effectively used in a reduction on a; at this point its contents are 
identical totheMYTiD dimension of a. After a is modified and used in the 
k and l loops, each thread restores a dimension of a's original values 
from its private copy of awork. This carries the appropriate dimension 
through the region unaltered. 
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Memory classes 


TheV-Class server implements only one partition of hypernode-local 
memory. This is accessed using the thread_private and 
node_private virtual memory classes. This chapter includes discussion 
of the foil owing topics: 

• Private versus shared memory 

• Memory class assignments 

The information in this chapter is provided for programmers who want to 
manually optimizetheir shared-memory programs on a single-node 
server. This is ultimately achieved by using compiler directives or 
pragmas to partition memory and otherwise control compiler 
optimizations. It can also be achieved using storage cl ass specifiers in C 
and C++. 
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Porting multi node applications to 
single-node servers 

Programs developed to run on multinode servers, such as the legacy 
X-Class server, can be run on K-Class or V-Class servers. The program 
runs as it would on one node of a multinode machine. 

When a multi node application is executed on a single-node server: 

• All PARALLEL, LOOP_PARALLEL, PREFER_PARALLEL, and 

begin_tasks directives containing node attributes are ignored. 

• All variables, arrays and pointers that are declared to be 

near_shared, far_shared, or block_shared are assigned to the 

NODE_PRIVATE daSS. 

• The thread_private and node_private classes remain 
unchanged and function as usual. 

Seethe Exemplar Programming Guidefor HP-UX Systems for a complete 
description of how to program multinode applications using H P parallel 
directives. 
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NOTE 


Private versus shared memory 

Privateand shared data are differentiated by their accessibility and by 
the physical memory classes in which they are stored. 

thread_private data is stored in node-local memory. Access to 
thread_private is restricted to the declaring thread. 

When porting multinode applications to the HP single-node machine, all 
legacy shared memory classes (such as near_shared, f ar_shared, 
and biock_shared) are automatically mapped to the node_private 
memory class. This is the default memory class on the K-Class and V- 
Class servers. 

thread_private 

thread_private data is private to each thread of a process. Each 
thread_private data object has its own unique virtual address within 
a hypernode. This virtual address maps to unique physical addresses in 
hypernode-local physical memory. 

Any sharing of thread_private data items between threads 
(regardless of whether they are running on the same node) must be done 
by synchronized copying of the item intoa shared variable, or by 
message passing. 

thread_private data cannot be initialized in C, C++, or in Fortran data 
statements. 

node_private 

node_private data is shared among thethreads of a process running 
on a given node. It is the default memory class on theV-Class single-node 
server, and does not need to be explicitly specified. node_private data 
items have one virtual address, and any thread on a node can access that 
node's node_private data using the same virtual address. This virtual 
address maps to a unique physical address in node-local memory. 
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Memory class assignments 

In Fortran, compiler directives are used to assign memory cl asses to data 
items. I n C and C++, memory classes are assigned through the use of 
syntax extensions, which aredefined in the header file 
/usr/include/spp_prog_modei. h. This file must be included in any 
C or C++program that uses memory classes. I n C++, you can also use 
operator new to assign memory classes. 

• The Fortran memory class declarations must appear with other 
specification statements; they cannot appear within executable 
statements. 

• I n C and C++, parallel storage class extensions are used, so memory 
classes are assigned in variable declarations. 

On a single-node system, H P compilers provide mechanisms for 
statically assigning memory classes. This chapter discusses these 
memory class assignments. 

The form of the directives and pragmas associated with is shown in 
Table 51. 


Table 51 Form of memory class directives and variable declarations 


Language 

Form 

Fortran 

c$dir memory_class_name(namelist) 

C/C++ 

#include <spp_prog_model.h> 


[storage_class_specifier] memory_class_nametype_specifier namelist 


where (for Fortran) 
memory_cl ass_name 

can be thread_private, or node_private 
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namelist 

is a comma-separated list of variables, arrays, and/or 
common block names to be assigned the class 
memory_class_nama common block names must be 
enclosed in slashes (/), and only entire common blocks 
can be assigned a class. This means arrays and 
variables in namelist must not also appear in a common 
block and must not beequivalenced to data objects in 
COMMON blocks. 

where (for C) 

storage_cl ass_speci f i er 

specifies a nonautomatic storage class 

memory_cl ass_name 

is the desired memory class (thread_private, 
node_private) 

type_specifier 

is a C or C-H-data type (int, float, etc.) 

namelist 

is a comma-separated list of variables and/or arrays of 
type type_specifier 

C and C-H-data objects 

I n C and C++, data objects that are assigned a memory class must have 
static storage duration. This means that ifthe object isdeclared within a 
function, it must have the storage cl ass extern or static. If such an 
object is not given one of these storage classes, its storage class defaults 
to automat ic and it is all ocated on the stack. Stack-based objects can not 
be assigned a memory class; attempting to do so results in a compile-time 
error. 

Data objects declared at file scope and assigned a memory class need not 
specify a storage class. 

All C and C++code examples presented in this chapter assume that the 
following line appears above the code presented: 

#include <spp_prog_model.h> 

This header file maps user symbols tothe implementation reserved 
space. 
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Memory class assignments 


If operator new is used, it is also assumed that the line below appears 
above the code: 

#include <new.h> 

If you assign a memory cl ass to a C or C++structure, all structure 
members must be of the same cl ass. 

Once a data item is assigned a memory class, the cl ass cannot be 
changed. 

Static assignments 

Static memory class assignments are physically located with variable 
type declarations in the source. Static memory cl asses are typically used 
with data objects that are accessed with equal frequency by all threads. 
These include objects Of the thread_private and node_private 
classes. Static assignments for all classes are explained in the 
subsections that follow. 

thread_private 

Because thread_private variables are replicated for every thread, 
static declarations make the most sense for them. 

thread_private 

I n Fortran, the thread_private memory class is assigned using the 
thread_private compiler directive, as shown in the following example: 

REAL*8 TPX(IOOO) 

REAL*8 TRY (100 0) 

REAL*8 TPZ (100 0), X, Y 
COMMON /BLK1/ TPZ, X, Y 
C$DIR THREAD_PRIVATE(TPX, TRY, /BLK1/) 

Each array declared here is 8000 bytes in size, and each scalar variable 
is 8 bytes, for a total of 24,016 bytes of data. The entire common block 
blki is placed in thread_private memory along with tpx and tpy. All 
memory space is replicated for each thread in hypernode-local physical 
memory. 
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Example 


Example 


thread_private 

The foil owing C/C++example demonstrates several ways to declare 
thread_private storage. The data objects declared here are not scoped 
analogously to those declared in the Fortran example: 

/* tpa is global: */ 
thread_private double tpa[1000]; 
func () { 

/* tpb is local to func: */ 

static thread_private double tpb[1000]; 

/* tpc, a and b are declared elsewhere: */ 
extern thread_private double tpc[1000],a,b; 


The C/C4+ double data type provides the same precision as Fortran's 
real* 8. The thread_private data declared here occupies the same 
amount of memory as that declared in the Fortran example, tpa is 
availableto all functions lexically following it in the file, tpb is local to 
func and inaccessible to other functions, tpc, a, and b are declared at 
filescope in another filethat is linked with this one. 

thread_private common blocks in parallel subroutines 

Data local toa procedure that is called in parallel is effectively private 
because storage for it is allocated on the thread's privatestack. Flowever, 
if the data is in a Fortran common block (or if it appears in a data or 
save statement), it is not stored on the stack. Parallel accesses to such 
nonprivate data must be synchronized if it is assigned a shared class. 
Additionally, if the parallel copies of the procedure do not need to share 
the data, it can be assigned a private class. 
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Consider the following Fortran example: 

INTEGER A (1000,1000) 


C$DIR LOOP_PARALLEL(THREADS) 
DO I = 1, N 

CALL PARCOM(A(1,I)) 


ENDDO 

SUBROUTINE PARCOM(A) 
INTEGER A(*) 

INTEGER C(1000), D(1000) 
COMMON /BLK1/ C, D 
C$DIR THREAD_PRIVATE(/BLK1/) 
INTEGER TEMPI, TEMP2 
D (1:1000) = . . . 


CALL PARCOM2(A, JTA) 


END 

SUBROUTINE PARCOM2(B,JTA) 
INTEGER B(*), JTA 
INTEGER C(IOOO), D(1000) 
COMMON /BLK1/ C, D 
C$DIR THREAD_PRIVATE(/BLK1/) 

DO J = 1, 1000 

C (J) = D (J) * B (J) 

ENDDO 

END 


I n this example, common block blki is declared thread_private, so 
every parallel instance of parcom gets its own copy of the arrays c and d. 

Becausethis code is already thread-parallel when the common block is 
defined, nofurther parallelism is possible, and blki is therefore suitable 
for use anywhere in parcom. The local variables tempi and temp2 are 
allocated on the stack, so each thread effectively has private copies of 
them. 
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Example 


node_private 

Because the space for node_private variables is physically replicated, 
static declarations make the most sense for them. 

I n Fortran, the node_private memory class is assigned using the 
node_private compiler directive, as shown in the following example: 

REAL*8 XNP (1000) 

REAL*8 YNP (1000) 

REAL*8 ZNP(1000), X, Y 
COMMON /BLK1/ ZNP, X, Y 
C$DIR NODE_PRIVATE(XNP, YNP, /BLK1/) 

Again, the data requires 24,016 bytes. The contents of blki are placed in 
node_private memory along with xnp and ynp. Space for each data 
item is replicated once per hypernode in hypernode-local physical 
memory. The same virtual address is used by each thread to access its 
hypernode’s copy of a data item. 

node_private variables and arrays can be initialized in Fortran data 
statements. 

node_private 

The foil owing example shows several ways to declare node_private 
data objects in C and C++: 

/* npa is global: */ 
node_private double npa[1000]; 
func () { 

/* npb is local to func: */ 

static node_private double npb [1000]; 

/* npc, a and b are declared elsewhere: */ 
extern node_private double npc[1000],a,b; 


The node_private data declared here occupies the same amount of 
memory as that declared in the Fortran example. Scoping rules for this 
data are similar to those given for thethread_private C/C++example. 
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Parallel synchronization 


Meet of the manual parallelization techniques discussed in "Parallel 
programming techniques,” on page 175, allow you to take advantage of 
thecompilers' automatic dependence checking and data privatization. 
The examples that used the loop_private and task_private 
directives and pragmas in "Data privatization,"on page 217, are 
exceptions tothis. I n these cases, manual privatization is required, but is 
performed on a loop-by-loop basis. Only the simplest data dependences 
are handled. 

This chapter discusses manual parallelizations and that handle multiple 
and ordered data dependences. This includes a discussion of the 
following topics: 

• Thread-parallelism 

• Synchronization tools 

• Synchronizing code 
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T h read-paral lei i sm 

Only one level of parallelism is supported: thread-parallelism. If you 
attempt to spawn thread-parallel ism from within a thread-parallel, your 
directives on the inner thread-parallel construct are ignored. 

Thread ID assignments 

Programs are initiated as a collection of threads, one per available 
processor. All but thread 0 are idle until parallelism is encountered. 

When a process begins, the threads created to run it have unique kernel 
thread I Ds. Thread 0, which runs all theserial code in the program, has 
kernel thread ID 0. The rest of the threads have unique but unspecified 
kernel thread I Ds at this point. The num_threads () intrinsic returns 
the number of threads created, regardless of how many are active when 
it is called. 

When thread 0 encounters parallelism, it spawns some or all of the 
threads created at program start. This means it causes these threads to 
go from idletoactive, at which point they begin working on their share of 
the parallel code. All availablethreads are spawned by default, but this 
is changed using various compiler directives. 

If the parallel structure is thread-parallel, then num_threads () threads 
are spawned, subject to user-specified limits. At this point, kernel thread 
0 becomes spawn thread 0, and the spawned threads are assigned spawn 
thread I Ds rangi ng from 0..num_threads () -1. This range begi ns at 
what used to be kernel thread 0. 

If you manually limit the number of spawned threads, these I Ds range 
from 0 to one less than your limit. 
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Synchronization tools 

The compiler cannot automatically parallelize loops containing complex 
dependences. However, a rich set of directives, pragmas, and data types 
is avail able to help you manually parallelize such loops by synchronizing 
and ordering access to the code containing the dependence. 

These directives can also be used to synchronize dependences in parallel 
tasks. They allow you toefficiently exploit parallelism in structures that 
would otherwise be unparallelizable. 

Using gates and barriers 

Gates allow you to restrict execution of a block of code to a singlethread. 
They are allocated, locked, unlocked, and deallocated using the functions 
described in "Synchronization functions”on page 246. They can also be 
used with the ordered or critical section directives, which automate the 
locking and unlocking functions. 

Barriers block further execution until all executing threads reach the 
barrier and then thread 0 can proceed past the barrier. 

Gates and barriers use dynamically allocatable variables, declared using 
compiler directives in Fortran and using data declarations in C and C++. 
They may be initialized and referenced only by passing them as 
arguments to the functions discussed in thefollowing sections. 

The forms of these variable declarations are shown in Table 52. 


Forms of gate and barriers variable declarations 


Language 

Form 

Fortran 

c$dir gate (namelist) 


c$dir barrier (namelist) 

C/C++ 

gate_t namelist; 


barrier_t namelist; 


where 
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namelist is a comma-separated list of one or more gate or barrier 

names, as appropriate. 

In C and C++ 

I n C and C+4-, gates and barriers should appear only in definition and 
declaration statements, and as formal, and actual arguments. They 
declare default-size variables. 

In Fortran 

The Fortran gate and barrier variable declarations can only appear: 

• I n common statements (statement must precede gate directive/ 
barrier directive) 

• I n dimension statements (statement must precede gate directive/ 
barrier directive) 

• I n precedi ng type statements 

• As dummy arguments 

• As actual arguments 

Gate and barrier types override other same-named types declared prior 
to the gate/barrier pragmas. Oncea variable is defined as a gateor 
barrier, it cannot be redeclared as another type. Gates and barriers 
cannot beequivalenced. 

If you place gates or barriers in common, the common block declaration 
must precede the gate directive/BARRiER directive. The common block 
should contain only gates or only barriers. Arrays of gates or barriers 
must be dimensioned using dimension statements. The dimension 
statement must precede the gate directive/BARRiER directive. 

Synchronization functions 

The Fortran, C, and C++allocation, deallocation, lock and unlock 
functions for use with gates and barriers aredescribed in this section. 
The 4- and 8-byte versions are provided. The 8-byte Fortran functions 
are primarily for use with compiler options that change the default data 
size to 8 bytes (for example, -is ). You must be consistent in your choice 
of versions—memory allocated using an 8-byte function must be 
deallocated using an 8-byte function. 
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Examples of using these functions are presented and explained 
throughout this section. 

Allocation functions 

Allocation functions allocate memory for a gate or barrier. When first 
allocated, gate variables are unlocked. The forms of these allocation 
functions are shown in Table 53. 

Forms of allocation functions 


Language 

Form 

Fortran 

INTEGER FUNCTION ALLOC_GATE (gate) 


INTEGER FUNCTION ALLOC_BARRIER ( barrier) 

C/C++ 

int alloc_gate(gate_t *gate_p) ; 


int alloc_barrier (barrier_t *barrier p) ; 


where (in Fortran) 

gateand barrier are gate or barrier variables, 
where (in C/C++) 
gate_p and 

barrier_p are pointers of the indicated type. 

Deallocation functions 

The deal location functions free the memory assigned to the specified gate 
or barrier variable. The forms of these deallocation functions are shown 
in Table 54. 
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Forms of deallocation functions 


Language 

Form 

Fortran 

INTEGER FUNCTION FREE_GATE (gate) 


INTEGER FUNCTION FREE_BARRIER ( barri er) 

C/C++ 

int free_gate(gate_t *gate_p) ; 


int free_barrier (barrier_t *barrier p) ; 


where (in Fortran) 

gateand barrier are gate or barrier variables previously declared in the 
gate and barrier allocation functions. 

where (in C/C++-) 
gate_p and 

barrier_p are pointers of the indicated type. 

Always free gates and barriers after using them. 

Locking functions 

The locking functions acquire a gate for exclusive access. If the gate 
cannot be immediately acquired, the cal ling thread waits for it. The 
conditional locking functions, which are prefixed with cond_ or cond_, 
acquire a gate only if a wait is not required. If the gate is acquired, the 
functions return 0; if not, they return -1. 

The forms of these locking functions are shown in Table 55. 

Forms of locki ng functions 


Language 

Form 

Fortran 

INTEGER FUNCTION LOCK_GATE (gate) 


INTEGER FUNCTION COND_LOCK_GATE (gate) 

C/C++ 

int lock_gate(gate_t *gate_p) ; 


int cond_lock_gate(gate_t *gate_p); 
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where (in Fortran) 

gate is a gate variable. 

where (in C/C++) 

gate_p is a pointer of the indicated type. 

Unlocking functions 

The unlocking functions release a gate from exclusive access. Gates are 
typically released by thethread that locks them, unless a gate was 
locked by thread 0 in serial code. I n that case it might be unlocked by a 
single different thread in a parallel construct. 

The form of these unlocking functions is shown in Table 56. 

Form of unlocking functions 


Language 

Form 

Fortran 

INTEGER FUNCTION UNLOCK_GATE (gate) 

C/C++ 

int unlock_gate(gate_t *gate_p); 


where (in Fortran) 

gate is a gate variable. 


where (in C/C++) 

gate_p is a pointer of the indicated type. 

Wait functions 

The wait functions use a barrier to cause the cal ling thread to wait until 
the specified number of threads call thefunction. At this point all 
threads are released from thefunction simultaneously. 

The form of the wait functions is shown in Table57. 
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Table 57 Form of wait functions 


Language 

Form 

Fortran 

integer function wait_barrier (barrier, nthr) 

C/C++ 

int wait_barrier (barrier_t *barrier p, const int *nthr) ; 


where (in Fortran) 

barrier is a barrier variable of the indicated type and nthr is 

the number of threads calling the routine. 

where (in C/C++) 

barrier_p is a pointer of the indicated type and nthr is a pointer 

referencing the number of threads calling the routine. 

You can use a barrier variable in multiple calls to the 
wait function, if you ensure that two such barriers are 
not simultaneously active. You must also verify that 
nthr reflects the correct number of threads. 

sync_routine 

Among the most basic optimizations performed by the HP compilers is 
codemotion, which is described in "Standard optimization features,"on 
page 35. This optimization moves code across routine cal Is. If the routine 
call is to a synchronization function that the compiler cannot identify as 
such, and the code moved must execute on a certain side of it, this 
movement may result in wrong answers. 

The compiler is aware of all synchronization functions and does not move 
code across them when they appear directly in code. However, if the 
synchronization function is hidden in a user-defined routine, the 
compiler has no way of knowi ng about it and may move code across it. 

Any time you call synchronization functions indirectly using your own 
routines, you must identify your routines with a sync_routine 
directive or pragma. 

The form of sync_routine is shown in Table 58. 
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Table 58 


Parallel synchronization 

Synchronization tools 


Form of sync_routine directive and pragma 


Language 

Form 

Fortran 

c$dir s ync_routine (routinelist) 

C 

#pragma CNX sync_routine (routi nel ist) 


where 

routinelist is a comma-separated list of synchronization routines. 

sync_routine 

sync_routine is effective only for the listed routines that lexically 
follow it in the same filewhere it appears. The following Fortran code 
example features the sync_routine directive: 

INTEGER MY_LOCK, MY_UNLOCK 
C$DIR GATE(LOCK) 

C$DIR SYNC_ROUTINE(MY_LOCK, MY_UNLOCK) 


LCK = ALLOC_GATE(LOCK) 
C$DIR LOOP_PARALLEL 
DO I = 1, N 

LCK = MY_LOCK(LOCK) 


SUM 

LCK 

ENDDO 


INTEGER FUNCTION MY_LOCK(LOCK) 
C$DIR GATE(LOCK) 

LCK = LOCK_GATE(LOCK) 

MY_LOCK = LCK 

RETURN 

END 

INTEGER FUNCTION MY_UNLOCK(LOCK) 
C$DIR GATE(LOCK) 

LCK = UNLOCK_GATE(LOCK) 

MY_UNLOCK = LCK 

RETURN 

END 


= SUM + A (I) 

= MY_UNLOCK(LOCK) 
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I n this example, my_lock and my_unlock are user functions that call 

the lock_gate and unlock_gate intrinsics. The sync_routine 

directive prevents the compiler from moving code across the calls to 

my_lock and my_unlock. 

Programming techniques such as this are used to implement portable 
code across several parallel architectures that support critical sections. 
This would be done using different syntax. For example, my_lock and 
my_unlock could simply be modified to call the correct locking and 
unlocking functions. 

sync_routine 

The following C example achieves the same task as shown in the 
previous Fortran example: 

#include <spp_prog_model.h> 
main() { 

int i, n, lck, sum, a[1000]; 
gate_t lock; 

#pragma _CNX sync_routine(mylock, myunlock) 


lck = alloc_gate(&lock) ; 

#pragma _CNX loop_parallel(ivar=i) 
for(1=0; i<n; i++) { 

lck = mylock(&lock) ; 


sum = sum+a[i] ; 

lck = myunlock(&lock) ; 



int mylock(gate_t *lock) { 
int lck; 

lck = lock_gate(lock); return lck; 

} 

int myunlock(gate_t *lock) { 
int lck; 

lck = unlock_gate(lock); 
return lck; 
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loop_parallel(ordered) 

The ioop_paraiiei (ordered) directive and pragma is designed to be 
used with ordered sections to execute loops with ordered dependences in 
loop order. It accomplishes this by parallelizing the loop so that 
consecutive iterations are initiated on separate processors, in loop order. 

While ioop_paraiiei (ordered) guarantees starting order, it does not 
guarantee ending order, and it provides no automatic synchronization. 
To avoid wrong answers, you must manually synchronize dependences 
using the ordered section directives, pragmas, or the synchronization 
intrinsics (see "Critical sections” on page 254 of this chapter for more 
information). 

loop_parallel, ordered 

The following Fortran code shows how ioop_paraiiei (ordered) is 
structured: 

C$DIR LOOP_PARALLEL(ORDERED) 

DO I = 1, 100 

. ICODE CONTAINING ORDERED SECTION 
ENDDO 

Assume that the body of this loop contains code that is parallelizable 
except for an ordered data dependence (otherwise there is no need to 
order the parallelization). Also assume that 8 threads, numbered 0..7, 
are available to run the loop in parallel. Each thread would then execute 
code equivalent to the following: 

DO I = (my_thread()+1), 100, num_threads() 

ENDDO 

Figure 17 illustrates this assumption. 
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Figure 17 Ordered parallelization 


DO I = 1,100,8 


DO I = 2,100,8 


DO I = 3, 100,8 


DO I = 4, 100,8 

ENDDO 


ENDDO 


ENDDO 


ENDDO 

THREAD 0 

THREAD 1 

THREAD 2 

THREAD 3 

DO I = 5,100,8 


DO I = 6,100,8 


DO I = 7, 100,8 


DO I = 8,100,8 

ENDDO 


ENDDO 


ENDDO 


ENDDO 

THREAD 4 

THREAD 5 

THREAD 6 

THREAD 7 


Here, thread 0 executes first, followed by thread 1, and soon. Each 
thread starts its iteration after the preceding iteration has started. A 
manually defined ordered section prevents one thread from executing 
the code in the ordered section until the previous thread exits the 
section. This means that thread 0 cannot enter the section for iteration 9 
until thread 7 exits it for iteration 8. 

This is efficient only if the loop body contains enough code to keep a 
thread busy until all other threads start their consecutive iterations, 
thus taking advantage of parallelism. 

You may find the max_threads attribute helpful when fine-tuning 
ioop_paraiiei (ordered) loops to fully exploit their parallel code. 

Examples of synchronizing ioop_paraiiei (ordered) loops areshown 
in "Synchronizing code" on page 257. 

Critical sections 

Critical sections allow you to synchronize simple, nonordered 
dependences. You must use the criticai_section directive or pragma 
to enter a critical section, and the end_criticai_section directive or 
pragma to exit one. 

Critical sections must not contain branches to outside the section. The 
two directives must appear in the same procedure, but they do not have 
to be in the same procedure as the parallel construct in which they are 
used. This means that the directives can exist in a procedurethat is 
called in parallel. 

The forms of these directives and pragmas areshown in Table 59. 
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Table 59 


NOTE 


Parallel synchronization 

Synchronization tools 


Forms Of critical_section, end_critical_section directives 
and pragmas 


Language 

Form 

Fortran 

C$DIR CRITICAL_SECTION [ (gate) ] 


C$DIR END_CRITICAL_SECTION 

C 

♦pragma _CNX critical_section [ (gate) ] 


♦pragma _CNX end_critical_section 


where 

gate is an optional gate variable used for access to the 

critical section, gate must be appropriately declared as 
described in the "Using gates and barriers" on 
page 245. 

The gate variable is required when synchronizing access to a shared 
variable from multiple parallel tasks. 

• When a gate variable is specified, it must be allocated (using the 
aiioc_gate intrinsic) outside of parallel code prior to use 

• If no gate is specified, the compiler creates a unique gate for the 
critical section 

• When a gate is no longer needed, it should be deallocated using the 
free_gate function. 

Critical sections add synchronization overhead to your program. They 
should only be used when the amount of parallel code is significantly larger 
than the amount of code containing the dependence. 


Ordered sections 

Ordered sections allow you to synchronize dependences that must 
execute in iteration order. The ordered_section and 
end_ordered_sect ion directives and pragmas are used to specify 
critical sections within manually defined, ordered ioop_paraiiei loops 
only. 

The forms of these directives and pragmas are shown in Table 60. 
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Table 60 


NOTE 


Parallel synchronization 

Synchronization tools 


Forms Of ordered_section, end_ordered_section directives and 
pragmas 


Language 

Form 

Fortran 

C$DIR ORDERED_SECTION (gate) 


C$DIR END_ORDERED_SECTION 

C 

#pragma _CNX ordered_section (gate) 


#pragma _CNX end_ordered_section 


where 

gate is a required gate variable that must be allocated and, 

if necessary, unlocked prior to invocation of the parallel 
loop containing the ordered section, gate must be 
appropriately declared as described in the "Using gates 
and barriers" section of this chapter. 

Ordered sections must be entered through ordered_section and 
exited through end_ordered_section. They cannot contain branches 
to outside the section. Ordered sections are subject to the same control 
flow rules as critical sections. 

As with critical sections, ordered sections should be used with care, as they 
add synchronization overhead to your program. They should only be used 
when the amount of parallel code is significantly larger than the amount of 
code containing the dependence. 
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Synchronizing code 

Code containing dependences are parallelized by synchronizing the way 
the parallel tasks access the dependence. This is done manually using 
the gates, barriers and synchronization functions discussed earlier in 
this chapter, or semi automatically using critical and ordered sections, 
described in the following sections. 


Using critical sections 

The criticai_section example on page 189 isolates a single critical 
section in a loop, sothat the criticai_section directive does not 
require a gate. I n this case, the critical section directives automate 
allocation, locking, unlocking and deallocation of the needed gate. 
Multiple dependences and dependences in manually-defined parallel 
tasks are handled when user-defined gates are used with the directives. 

critical sections 

The following Fortran example, however, uses the manual methods of 
codesynchronization: 

REAL GLOBAL_SUM 
C$DIR FAR_SHARED(GLOBAL_SUM) 

C$DIR GATE(SUM_GATE) 


LOCK = ALLOC_GATE(SUM_GATE) 

C$DIR BEGIN_TASKS 

CONTRIB1 =0.0 
DO J = 1, M 

CONTRIB1 = CONTRIB1 + FUNCl(J) 
ENDDO 


C$DIR CRITICAL_SECTION (SUM_GATE) 

GLOBAL_SUM = GLOBAL_SUM + CONTRIB1 
C$DIR END_CRITICAL_SECTION 


C$DIR NEXT_TASK 
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CONTRIB2 =0.0 
DO I = 1, N 

CONTRIB2 = CONTRIB2 + FUNC2(J) 
ENDDO 


C$DIR CRITICAL_SECTION (SUM_GATE) 

GLOBAL_SUM = GLOBAL_SUM + CONTRIB2 
C$DIR END_CRITICAL_SECTION 


C$DIR END_TASKS 

LOCK = FREE_GATE(SUM_GATE) 

Here, both parallel tasks must access the shared global_sum variable. 
To ensure that global_sum is updated by only one task at a time, it is 
placed in a critical section. The critical sections both reference the 
sum_gate variable. This variable is unlocked on entry into the parallel 
code (gates are always unlocked when they are allocated). 

When one task reaches the critical section, the critical_section 
directive automatically locks sum_gate. The end_critical_section 
directive unlocks sum_gate on exit from the section. Because access to 
both critical sections is controlled by a single gate, the sections must 
execute one at a ti me. 

Gated critical sections 

Gated critical sections are also useful in loops containing multiple 
critical sections when there are dependences between the critical 
sections. I f no dependences exist between the sections, gates are not 
needed. Thecompiler automatically supplies a unique gate for every 
critical section lacking a gate. 

The C example below uses gates so that threads do not update at the 
same time, within a critical section: 

static far_shared float absum; 
static gate_t gatel; 
int adjb[...]; 


lock = alloc_gate(Sgatel); 

#pragma _CNX loop_parallel(ivar=i) 
for (i=0;i<n;i + +) { 

a [i] = b [ i] + c[i] ; 

#pragma _CNX critical_section(gatel) 
absum = absum + a[i]; 

#pragma _CNX end_critical_section 
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if(adjb[i]) { 

b[i] = c[i] + d[i]; 

#pragma _CNX critical_section (gatel) 
absum = absum + b[i]; 

#pragma _CNX end_critical_section 
} 


} 

lock = free_gate(sgatel); 

The shared variable absum must be updated after a (i) is assigned and 
agai n if b (i) is assigned. Access to absum must be guarded by the same 
gate to ensure that two threads do not attempt to update it at once. The 
critical sections protecting the assignment to absum must explicitly 
name this gate, or the compiler chooses unique gates for each section, 
potentially resulting in incorrect answers.There must be a substantial 
amount of pa rail el izable code outside of these critical sections to make 
parallelizing this loop cost-effective. 

Using ordered sections 

Like critical sections, ordered sections lock and unlock a specified gate to 
isolate a section of code in a loop. However, they also ensure that the 
enclosed section of code executes in the same order as the iterations of 
the ordered parallel loop that contains it. 

Once a given thread passes through an ordered section, it cannot enter 
again until all other threads have passed through in order. This ordering 
is difficult to implement without using the ordered section directives or 
pragmas. 

You must use a ioop_paraiiei (ordered) directive or pragma to 
parallelize any loop containing an ordered section. See 
"ioop_paraiiei (ordered)"on page 253 for a description of this. 

Ordered sections 

The following Fortran example contains a backward loop-carried 
dependence on the array a that would normally inhibit parallelization. 

DO I = 2, N 

. ! PARALLELIZABLE CODE... 

A(I) = A(I—1) + B(I) 

. ! MORE PARALLELIZABLE CODE... 
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ENDDO 

Assuming that the dependence shown is the only one in the loop, and 
that a significant amount of parallel code exists elsewhere in the loop, 
the dependence is isolated. The loop is parallelized as shown below: 

C$DIR GATE(LCD) 

LOCK = ALLOC_GATE(LCD) 


LOCK = UNLOCK_GATE(LCD) 

C$DIR LOOP_PARALLEL(ORDERED) 

DO I = 2, N 

. ! PARALLELIZABLE CODE... 


C$DIR ORDERED_SECTION(LCD) 

A(I) = A(I-l) + B(I) 


C$DIR END_ORDERED_SECTION 

. ! MORE PARALLELIZABLE CODE... 


ENDDO 

LOCK = FREE_GATE(LCD) 

The ordered section containing the a (i) assignment executes in 
iteration order. This ensures that the value of a ( 1-1 ) used in the 
assignment is always valid. Assuming this loop runs on four threads, the 
synchronization of statement execution between threads is illustrated in 
Figure 18. 
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Order of statement execution 


bss Statements contained within ordered sections 
cn Nonordered section statements 


As shown by the dashed lines between initial iterations for each thread, 
one ordered section must be completed before thenext is all owed to begin 
execution. Once a thread exits an ordered section, it cannot reenter until 
all other threads have passed through in sequence. 

Overlap of nonordered statements, represented as lightly shaded boxes, 
allows all threads to proceed fully loaded. Only brief idle periods occur on 
1, 2, and 3 at the beginning of the loop, and on 0, 1, and 2 at the end. 

Ordered section limitations 

Each thread in a parallel loop containing an ordered section must pass 
through the ordered section exactly once on every iteration of the loop. If 
you execute an ordered section conditionally, you must execute it in all 
possible branches of the condition. If the code contained in the section is 
not valid for some branches, you can insert a blank ordered section, as 
shown in the following Fortran example: 

C$DIR GATE (LCD) 


LOCK = ALLOC_GATE(LCD) 
C$DIR LOOP_PARALLEL(ORDERED) 
DO I = 1, N 


Chapter 13 


261 



Parallel synchronization 

Synchronizing code 


IF (Z(I) .GT. 0.0) THEN 
ORDERED_SECTION(LCD) 

HERE'S THE BACKWARD LCD: 

A(I) = A(1-1) + B(I) 

END_ORDERED_SECTION 
ELSE 

HERE IS THE BLANK ORDERED SECTION: 
ORDERED_SECTION(LCD) 
END_ORDERED_SECTION 
END IF 


ENDDO 

LOCK = FREE_GATE(LCD) 

No matter which path through the if statement the loop takes, and 
though the else section is empty, it must passthrough the ordered 
section. This allows the compiler to properly synchronize the ordered 
loop. It is assumed that a substantial amount of parallel code exists 
outside the ordered sections, to offset the synchronization overhead. 

Ordered section limitations 

Ordered sections within nested loops can create similar, but more 
difficult to recognize, problems. Consider thefollowing Fortran example 
(gate manipulation is omitted for brevity): 

C$DIR LOOP_PARALLEL(ORDERED) 

DO I = 1, 99 
DO J = 1,M 


C$DIR ORDERED_SECTION(ORDGATE) 

A (I, J) = A (1 + 1, J) 

C$DIR END_ORDERED_SECTION 


C$D IR 
C 

C$D IR 
C 

C$D IR 
C$D IR 


ENDDO 

ENDDO 

Recall that once a given thread has passed through an ordered section, it 
cannot reenter it until all other threads have passed through in order. 
This is only possible in the given example if the number of available 
threads integrally divides 99 (the i loop limit). If not, deadlock results. 

To better understand this: 

• Assume 6 threads, numbered 0 through 5, are running the parallel i 
loop. 
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• For i =1, j = 1, thread 0 passes through the ordered section and loops 
back through j, stopping when it reaches the ordered section again 
for i =1, j = 2. It cannot enter until threads 1 through 5 (which are 
executing i =2 through 6, j =1 respectively) passthrough in 
sequence. This is not a problem, and the loop proceeds through i =96 
in this fashion in parallel. 

• For i >96, all 6 threads are no longer needed. I n a single loop nest 
this would not pose a problem as the leftover 3 iterations would be 
handled by threads 0 through 2. When thread 2 exited the ordered 
section it would hit the enddo and the i loop would terminate 
normally. 

• But in thisexample, the j loop isolates the ordered section from the i 
loop, so thread 0 executes j = lfor i =97, loops through j and waits 
during j =2 at the ordered section for thread 5, which has gone idle, 
to complete. Threads 1 and 2 similarly execute j = 1 for i =98 and 

i =99, and similarly wait after incrementing j to 2. Theentire j loop 
must terminate before the i loop can terminate, but the j loop can 
never terminate becausethe idlethreads 3, 4, and 5 never pass 
through the ordered section. As a result, deadlock occurs. 

To handlethis problem, you can expand theordered section to include 

the entire j loop, as shown in thefollowing C example: 

#pragma _CNX loop_parallel(ordered,ivar=i) 

for (i=0;i<99;i + + ) { 

#pragma _CNX ordered_section(ordgate) 
for(j=0;j<m;j++) { 


a[i][j] 


a[i + 1] [ j] ; 


} 

#pragma _CNX end_ordered_section 
} 

I n this approach, each thread executes theentire j loop each time it 
enters the ordered section, allowing the i loop to terminate normally 
regardless of the number of threads available. 

Another approach is to manually interchange the i and j loops, as 
shown in thefollowing Fortran example: 
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DO J = 1, M 

LOCK = UNLOCK_GATE(ORDGATE) 
C$DIR LOOP_PARALLEL(ORDERED) 

DO I = 1, 99 


ORDERED_SECTION(ORDGATE) 
A (I, J) = A (1 + 1, J) 
END_ORDERED_SECTION 


C$D IR 
C$D IR 


ENDDO 

ENDDO 

Here, the i loop is parallelized on every iteration of the j loop. The 
ordered section is not isolated from its parent loop, so the loop can 
terminate normally. This example has added benefit; elements of a are 
accessed more efficiently. 

Manual synchronization 

Ordered and critical sections allow you to isolate dependences in a 
structured, semiautomatic manner. The same isolation is accomplished 
manually using the functions discussed in "Synchronization functions” 
on page 246. 

Critical sections and gates 

Below is a simple critical section Fortran example using 

loop_parallel: 

C$DIR LOOP_PARALLEL 

DO I = 1, N ! LOOP IS PARALLELIZABLE 


C$DIR CRITICAL_SECTION 

SUM = SUM + X (I) 

C$DIR END_CRITICAL_SECTION 


ENDDO 

As shown, this example is easily implemented using critical sections. It is 
manually implemented in Fortran, using gate functions, as shown below: 

C$DIR GATE(CRITSEC) 
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LOCK = ALLOC_GATE(CRITSEC) 
C$DIR LOOP_PARALLEL 
DO I = 1, N 


LOCK = LOCK_GATE(CRITSEC) 
SUM = SUM + X (I) 

LOCK = UNLOCK_GATE(CRITSEC) 


ENDDO 

LOCK = FREE_GATE(CRITSEC) 

As shown, the manual implementation requires declaring, allocating, 
and deallocating a gate, which must be locked on entry intothecritical 
section using the lock_gate function and unlocked on exit using 

UNLOCK_GATE. 

Conditionally lock critical sections 

Another advantage of manually defined critical sections is the ability to 
conditionally lock them. This allows thetask that wishes to execute the 
section to proceed with other work if the lock cannot be acquired. This 
construct is useful, for example, in situations where one thread is 
performing I/O for several other parallel threads. 

Whilea processing thread is reading from the input queue, thequeueis 
locked, and the I/O thread can move on to do output. While a processing 
thread is writing to the output queue, the I/O thread can do input. This 
allows the I/O thread to keep as busy as possible whilethe parallel 
computational threads execute their (presumably large) computational 
code. 

This situation is illustrated in the following Fortran example. Task 1 
performs I/O for the 7 other tasks, which perform parallel computations 
by calling the thread_wrk subroutine: 

COMMON INGATE,OUTGATE, COMPBAR 
C$DIR GATE (INGATE, OUTGATE) 

C$DIR BARRIER (COMPBAR) 

REAL DIN(:), DOUT(:) ! I/O BUFFERS FOR TASK 1 

ALLOCATABLE DIN, DOUT ! THREAD 0 WILL ALLOCATE 

REAL QIN(1000,1000), QOUT(1000,1000) ! SHARED I/O QUEUES 

INTEGER NIN/0/,NOUT/0/ ! QUEUE ENTRY COUNTERS 
C CIRCULAR BUFFER POINTERS: 

INTEGER IN_QIN/1/,OUT_QIN/1/,IN_QOUT/1/,OUT_QOUT/1/ 

COMMON /DONE/ DONEIN, DONECOMP 
LOGICAL DONECOMP, DONE IN 

C SIGNALS FOR COMPUTATION DONE AND 

INPUT DONE 
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LOGICAL COMPDONE, INDONE 

C FUNCTIONS TO RETURN DONECOMP AND 

DONEIN 

LOGICAL INFLAG, OUTFLAG ! INPUT READ AND OUTPUT WRITE 

FLAGS 

C$DIR THREAD_PRIVATE (INFLAG,OUTFLAG) ! ONLY NEEDED BY TASK 1 

C (WHICH RUNS ON THREAD 0) 

IF (NUM_THREADS() .LT. 8) STOP 1 

IN = 10 
OUT =11 

LOCK = ALLOC_GATE(INGATE) 

LOCK = ALLOC_GATE(OUTGATE) 

IBAR = ALLOC_BARRIER(COMPBAR) 

DONECOMP = .FALSE. 

C$DIR BEGIN_TASKS ! TASK 1 STARTS HERE 

INFLAG = -TRUE. 

DONEIN = .FALSE. 

ALLOCATE(DIN(1000) ,DOUT (1000) ) ! ALLOCATE LOCAL BUFFERS 

DO WHILE(.NOT. INDONE() .OR. .NOT. COMPDONE() .OR. NOUT 
.GT. 0) 

C DO TILL EOF AND COMPUTATION DONE AND 

OUTPUT DONE 

IF(NIN.LT.1000.AND.(.NOT.COMPDONE()) .AND.(.NOT. 
INDONEO)) THEN 


C FILL QUEUE 

IF (INFLAG) THEN ! FILL BUFFER FIRST: 

READ(IN, IOSTAT = IOS) DIN ! READ A RECORD; QUIT ON 

EOF 

IF(IOS -EQ. -1) THEN 

DONEIN = .TRUE. ! SIGNAL THAT INPUT IS DONE 
INFLAG = .TRUE. 

ELSE 

INFLAG = .FALSE. 

END IF 
ENDIF 

C SYNCHRONOUSLY ENTER INTO INPUT QUEUE: 

C BLOCK QUEUE ACCESS WITH INGATE: 

IF (COND_LOCK_GATE(INGATE) .EQ. 0 .AND. .NOT. INDONEO) 

THEN 

QIN ( :,IN_QIN) = DIN(:) ! COPY INPUT BUFFER INTO QIN 
IN_QIN=l+MOD(IN_QIN,1000) ! INCREMENT INPUT BUFFER 

PTR 

NIN = NIN + 1 ! INCREMENT INPUT QUEUE ENTRY COUNTER 

INFLAG = -TRUE. 

LOCK = UNLOCK_GATE(INGATE) ! ALLOW INPUT QUEUE 

ACCESS 

ENDIF 

ENDIF 

C SYNCHRONOUSLY REMOVE FROM OUTPUT QUEUE: 

C BLOCK QUEUE ACCESS WITH OUTGATE: 

IF (COND_LOCK_GATE(OUTGATE) .EQ. 0) THEN 
IF (NOUT .GT. 0) THEN 

DOUT(:)=QOUT(:,OUT_QOUT) ! COPY OUTPUT QUE INTO 
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BUFFR 

OUT_QOUT=l+MOD(OUT_QOUT,1000) 

C INCREMENT OUTPUT BUFR PTR 

NOUT = NOUT - 1 ! DECREMENT OUTPUT QUEUE ENTRY 

COUNTR 

OUTFLAG = .TRUE. 

ELSE 

OUTFLAG = .FALSE. 

ENDIF 

LOCK = UNLOCK_GATE(OUTGATE) 

C ALLOW OUTPUT QUEUE ACCESS 

IF (OUTFLAG) WRITE(OUT) DOUT ! WRITE A RECORD 
ENDIF 
ENDDO 

C TASK 1 ENDS HERE 

C$D IR NEXT_TASK ! TASK 2: 

CALL 

THREAD_WRK(NIN,NOUT,QIN,QOUT,IN_QIN,OUT_QIN, IN_QOUT,OUT_QOUT) 

IBAR = WAIT_BARRIER(COMPBAR,7) 

C$D IR NEXT_TASK ! TASK 3: 

CALL 

THREAD_WRK(NIN,NOUT,QIN,QOUT,IN_QIN,OUT_QIN, IN_QOUT,OUT_QOUT) 

IBAR = WAIT_BARRIER(COMPBAR,7) 

C$D IR NEXT_TASK ! TASK 4: 

CALL 

THREAD_WRK(NIN,NOUT,QIN,QOUT,IN_QIN,OUT_QIN, IN_QOUT,OUT_QOUT) 

IBAR = WAIT_BARRIER(COMPBAR,7) 

C$D IR NEXT_TASK ! TASK 5: 

CALL 

THREAD_WRK(NIN,NOUT,QIN,QOUT,IN_QIN,OUT_QIN, IN_QOUT,OUT_QOUT) 

IBAR = WAIT_BARRIER(COMPBAR,7) 

C$D IR NEXT_TASK ! TASK 6: 

CALL 

THREAD_WRK(NIN,NOUT,QIN,QOUT,IN_QIN,OUT_QIN, IN_QOUT,OUT_QOUT) 

IBAR = WAIT_BARRIER(COMPBAR,7) 

C$D IR NEXT_TASK ! TASK 7: 

CALL 

THREAD_WRK(NIN,NOUT,QIN,QOUT,IN_QIN,OUT_QIN, IN_QOUT,OUT_QOUT) 

IBAR = WAIT_BARRIER(COMPBAR,7) 

C$DIR NEXT_TASK ! TASK 8: 

CALL 

THREAD_WRK(NIN,NOUT,QIN,QOUT,IN_QIN,OUT_QIN, IN_QOUT,OUT_QOUT) 

IBAR = WAIT_BARRIER(COMPBAR,7) 

DONECOMP = .TRUE. 

C$DIR END_TASKS 
END 

Before looking at the thread_wrk subroutine it is necessary to examine 
these parallel tasks, particularly task 1, the I/O server. Task 1 performs 
all the I/O required by all thetasks: 

• Conditionally locked gates control task l's access to one section of 
code that fi I Is the i nput queue and one that empti es the output queue. 
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• Task 1 works by first filling an input buffer. The code that does this 
does not require gate protection because no other tasks attempt to 
access the i nput buffer array. 

• The section of code where the input buffer is copied intotheinput 
queue, however, must be protected by gates to prevent any threads 
from trying to read the input queue while it is being filled. 

The other seven tasks perform computational work, receiving their input 
from and sending their output to task l's queues. If a task acquires a lock 
on the input queue, task 1 cannot fill it until the task is done reading 
from it. 

• When task 1 cannot get a lock to access the input queue code, it tries 
to lock the output queue code. 

• If it gets a lock here, it can copy the output queue into the output 
buffer array and relinquish the lock. It can then proceed to empty the 
output buffer. 

• If another task is writing to the output queue, task 1 loops back and 
begins the entire process over again. 

• When theend of the input file is reached, all computation is complete, 
and the output queue is empty: task 1 is finished. 

The task loops on donein (using indone ()), which is initially false. When 
input is exhausted, donein is set to true, signalling all tasks that there is no 
more input. 

The indone ( ) function references donein, forcing a memory reference. 
If donein were referenced directly, the compiler might optimize it into a 
register and consequently not detect a change in its value. 

This means that task 1 has four main jobs to do: 

1 Read input into input buffer—no other tasks access the input buffer. 
This is donein parallel regardless of what other tasks are doing, as 
long as the buffer needs filling. 

2 Copy input buffer into input queue—the other tasks read their input 
from the input queue, therefore it can only be filled when no 
computational task is reading it. This section of code is protected by 
the ingate gate. It can run in parallel with the computational 
portions of other tasks, but only one task can access the input queue 
at a ti me. 
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3 Copy output queue into output buffer—the output queue is where 
other tasks writetheir output. It can only be emptied when no 
computational task is writing to it. This section of code is protected by 
the outgate gate. It can run in parallel with the computational 
portions of other tasks, but only one task can access the output queue 
at a ti me. 

4 Write out output buffer—no other tasks access the output buffer. This 
is done in parallel regardless of what the other tasks are doing. 

Next, it is important to look at the subroutine thread_wrk, which tasks 
2-7 call to perform computations. 

SUBROUTINE 

> 

THREAD_WRK(NIN,NOUT,QIN,QOUT,IN_QIN,OUT_QIN,IN_QOUT,OUT_QOUT) 
INTEGER NIN,NOUT 

REAL QIN (1000, 1000), QOUT(1000, 1000) ! SHARED I/O QUEUES 

INTEGER OUT_QIN, OUT_QOUT 
COMMON INGATE,OUTGATE,COMPBAR 
GATE(INGATE, OUTGATE) 

REAL WORK(1000) ! LOCAL THREAD PRIVATE WORK ARRAY 

LOGICAL OUTFLAG, INDONE 
OUTFLAG = .FALSE. 

THREAD_PRIVATE (WORK) ! EVERY THREAD WILL CREATE A COPY 

DO WHILE(.NOT. INDONE() .OR. NIN.GT.O .OR. OUTFLAG) 

WORK/QOUT EMPTYING LOOP 

IF (.NOT. OUTFLAG) THEN ! IF NO PENDING OUTPUT 
CRITICAL_SECTION (INGATE) ! BLOCK ACCESS TO INPUT QUE 
IF (NIN .GT. 0) THEN ! MORE WORK TO DO 
WORK(:) = QIN(:,OUT_QIN) 

OUT_QIN = 1 + MOD(OUT_QIN, 1000) 

NIN = NIN - 1 
OUTFLAG = -TRUE. 

INDICATE THAT INPUT DATA HAS BEEN 

RECEIVED 

ENDIF 

C$DIR END_CRITICAL_SECTION 

! SIGNIFICANT PARALLEL CODE HERE USING WORK ARRAY 
ENDIF 

IF (OUTFLAG) THEN ! IF PENDING OUTPUT, MOVE TO OUTPUT 

QUEUE 

C AFTER INPUT QUEUE IS USED IN COMPUTATION, FILL OUTPUT QUEUE: 

C$DIR CRITICAL_SECTION (OUTGATE) ! BLOCK ACCESS TO OUTPUT QUEUE 
IF(NOUT.LT.1000) THEN 

C IF THERE IS ROOM IN THE OUTPUT QUEUE 

QOUT(:,IN_QOUT) = WORK(:) ! COPY WORK INTO OUTPUT 

QUEUE 

IN_QOUT =l+MOD(IN_QOUT,1000) ! INCREMENT BUFFER PTR 

NOUT = NOUT + 1 ! INCREMENT OUTPUT QUEUE ENTRY 

COUNTER 


C$D IR 

C$D IR 

C 

C$D IR 

C 
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OUTFLAG = .FALSE. ! INDICATE NO OUTPUT PENDING 
ENDIF 

C$DIR END_CRITICAL_SECTION 
ENDIF 

ENDDO ! END WORK/QOUT EMPTYING LOOP 
END ! END THREAD_WRK 

LOGICAL FUNCTION INDONE() 

C THIS FUNCTION FORCES A MEMORY REFERENCE TO GET THE DONEIN VALUE 
LOGICAL DONEIN 

COMMON /DONE/ DONEIN, DONECOMP 

INDONE = DONEIN 

END 

LOGICAL FUNCTION COMPDONE() 

C THIS FUNCTION FORCES A MEMORY REFERENCE TO GET THE DONECOMP 
VALUE 

LOGICAL DONECOMP 

COMMON /DONE/ DONEIN, DONECOMP 

COMPDONE= DONECOMP 

END 

Notice that the gates are accessed through common blocks. Each thread 
that calls this subroutine allocates a thread_private work array. 

This subroutine contains a loop that tests indone (). 

• The loop copies the input queue into the local work array, then does a 
significant amount of computational work that has been omitted for 
simplicity. 

The computational work is the main code that executes in parallel, if there is 
not a large amount of it, the overhead of setting up these parallel tasks and 
critical sections cannot be justified. 

• The loop encompasses this computation, and also the section of code 
that copies the work array to the output queue. 

• This construct allows final output to be written after all input has 
been used in computation. 

• To avoid accessing the input queue while it is being filled or accessed 
by another thread, the section of code that copies it into the local 
work array is protected by a critical section. 

This section must be unconditionally locked as the computational threads 
cannot do something else until they receive their input. 

Once the input queue has been copied, thread_wrk can perform its 
large section of computational code in parallel with whatever the other 
tasks are doing. After the computational section is finished, another 
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unconditional critical section must be entered so that the results are 
written to the output queue. This prevents two threads from accessing 
the output queue at once. 

Problems I ike this require performance testing and tuning to achieve 
optimal parallel efficiency. Variables such as the number of 
computational threads and the size of the I/O queues are adjusted to 
yield the best processor utilization. 
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This chapter discusses common optimization problems that occasionally 
occur when developing programs for SMP servers. Possiblesolutions to 
these problems are offered where applicable. 

Optimization can remove instructions, replace them, and change the 
order in which they execute. I n some cases, improper optimizations can 
cause unexpected or incorrect results or codethat slows down at higher 
optimization levels. I n other cases, user error can cause similar problems 
in codethat contains improperly used syntactically correct constructs or 
directives. If you encounter any of these problems, look for the following 
possible causes: 

• Aliasing 

• False cache line sharing 

• Floating-point imprecision 

• I nvalid subscripts 

• Misused directives and pragmas 

• Triangular loops 

• Compiler assumptions 

Compilers perform optimizations assuming that the source code being 
compiled is valid. Optimizations done on source that violates certain ANSI 
standard rules can cause the compilers to generate incorrect code. 
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Aliasing 

As described in the section "I nhibiting parallelization" on page 105, an 
alias is an alternate name for an object. Fortran equivalence 
statements, C pointers, and procedure calls in both languages can 
potentially cause aliasing problems. Problems can and do occur at 
optimization levels +03 and above. However, code motion can also cause 
aliasing problems at optimization levels +01 and above. 

Because they frequently use pointers, C programs are especially 
susceptible to aliasing problems. By default, the optimizer assumes that 
a pointer can point to any object in the entire application. Thus, any two 
pointers are potential aliases. The C compiler has two algorithms you 
can specify in place of the default: an ANSI-C aliasing algorithm and a 
type-safe algorithm. 

The ANSI-C algorithm is enabled [disabled] through the 

+0 [no] ptrs_ansi option. 

The type-safe algorithm is enabled [disabled] by specifying the 
command-line option +0 [no] ptrs_strongly_typed. 

The defaults for these options are +Onoptrs_ansi and 
+Onoptrs_strongly_typed. 

ANSI algorithm 

ANSI C provides strict type-checking. Pointers and variables cannot 
alias with pointers or variables of a different base type. The ANSI C 
aliasing algorithm may not be safe if your program is not ANSI 
compliant. 

Type-safe algorithm 

The type-safe algorithm provides stricter type-checking. This allows the 
C compiler to use a stricter algorithm that eliminates many potential 
aliases found by the ANSI algorithm. 
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Specifying aliasing modes 

Tospecify an aliasing mode, use one of the foil owing options on theC 
compiler command line: 

• +Optrs_ansi 

• +Optrs_strongly_typed 

Additional C aliasing options arediscussed in "Controlling optimization" 
on page 113. 

Iteration and stop values 

Aliasing a variable in an array subscript can make it unsafe for the 
compiler to parallelize a loop. Below are several situations that can 
prevent parallelization. 

Using potential aliases as addresses of variables 

I n the following example, the code passes & j to getval; getval can use 
that address in any number of ways, including possibly assigning it to 
iptr. Even though iptr is not passed to getval, getval might still 
access it as a global variable or through another alias. This situation 
makes j a potential alias for *iptr. 

void subexfiptr, n, j) 
int *iptr, n, j; 

{ 

n = getval(& j,n) ; 

for (j —; j<n; j+ + ) 
iptr[j] += 1; 

} 

This potential alias means that j and iptr [ j ] might occupy the same 
memory space for some value of j. The assignment to iptr [ j ] on that 
iteration would also change the value of j itself. The possible alteration 
of j prevents the compiler from safely parallelizing the loop. I n this case, 
the Optimization Report says that no induction variablecould be found 
for the loop, and the compiler does not parallelize the loop. (For 
information on Optimization Reports, see "Optimization Report" on 
page 151). 
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Avoid taking the address of any variable that is used as the iteration 
variable for a loop. To parallelize the loop in subex, use a temporary 
variable i as shown in the foil owing code: 

void subex(iptr, n, j) 
int *iptr, n, j; 

{ 

int i; 

n = getval(& j,n); 

i=j; 

for (i—; i<n; i++) 
iptr[i] += 1; 

} 

Using hidden aliases as pointers 

I n the next example, iaiex takes the address of j and assigns it to *ip. 
Thus, j becomes an alias for *ip and, potentially, for *iptr. Assigned 
values to iptr [ j ] within the loop could alter the value of j. As a result, 
the compiler cannot use j as an induction variable and, without an 
induction variable, it cannot count the iterations of the loop. When the 
compiler cannot find the loop's iteration count the compiler cannot 
parallelize the loop. 

int *ip; 

void iaiex(iptr) 
int *iptr;{ 
int j; 

*ip = & j; { 

for (j=0; j<2048; j++) 
iptr[j] = 107; 

} 

To parallelize this loop, remove the line of code that takes the address of 
j or introduce a temporary variable. 

Using a pointer as a loop counter 

Compiling the following function, the compiler finds that * j is not an 
induction variable. This is because an assignment to iptr [* j ] could 
alter the value of * j within the loop. The compiler does not parallelize 
the loop. 

void ialex2 (iptr, j, n) 
int *iptr; 
int *j, n; 

{ 

for (*j=0; *j<n; (*j)++) 

iptr[* j] = 107; 

} 
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Again, this problem is solved by introducing a temporary iteration 
variable. 

Aliasing stop variables 

I n the foil owing code, the stop variable n becomes a possible alias for 
*iptr when &n is passed to foo. This means that n is altered during the 
execution of the loop. As a result, the compiler cannot count the number 
of iterations and cannot parallelize the loop. 

void salexfint *iptr, int n) 

{ 

int i; 
foo(&n); 

for (i=0; i < n; i++) 
iptr[i] += iptr[i]; 
return; 

} 

To parallelize the affected loop, eliminate the cal I to foo, movethecall 
below the loop. I n this case, flow-sensitive analysis takes care of the 
aliasing. You can also create a temporary variable as shown below: 

void salex(int *iptr, int n) 

{ 

int i, tmp; 
foo(&n); 
tmp = n; 

for (i=0; i < tmp; i++) 
iptr[i] += iptr[i]; 
return; 

} 

Because tmp is not aliased to iptr, the loop has a fixed stop value and 
the compiler parallelizes it. 

Global variables 

Potential aliases involving global variables cause optimization problems 
in many programs. The compiler cannot tell whether another function 
causes a global variable to become aliased. 

The following code uses a global variable, n, as a stop value. Because n 
may have its address taken and assigned to ik outside the scope of the 
function, n must be considered a potential alias for *ik. The value of n, 
therefore, is altered on any iteration of the loop. The compiler cannot 
determine the stop value and cannot parallelize the loop. 
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int n, *ik; 
void foo(int *ik) 

{ 

int i; 

for (i=0; i<n; i++) 
ik[i]=i; 

} 

Using a temporary local variable solves the problem. 

int n; 

void foo(int *ik) 

{ 

int i,stop = n; 

for (i=0; i<stop; t+i) 
ik[i]=i; 

} 

If ik is a global variable instead of a pointer, the problem does not occur. 
Global variables do not causealiasing problems except when pointers are 
involved. The following code is parallelized: 

int n, ik[1000] ; 
void foo () 

{ 

int i; 

for (i=0; i<n; it+) 
ik [i] = i; 

} 
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False cache line sharing 

False cache line sharing is a form of cache thrashing. It occurs whenever 
two or morethreads in a parallel program are assigning different data 
items in the same cache line. This section discusses how to avoid false 
cache line sharing by restructuring the data layout and controlling the 
distribution of loop iterations among threads. 

Consider the following Fortran code: 

REAL*4 A (8) 

DO I = 1, 8 
A (I) = ... 


ENDDO 

Assume there are eight threads, each executing one of the above 
iterations, a (l) is on a processor cache line boundary (32-byte boundary 
for V2250 servers) so that all eight elements are in the same cache line. 
Only one thread at a time can "own" the cache line, so not only is the 
above loop, in effect, run serially, but every assignment by a thread 
requires an invalidation of the line in thecacheof its previous "owner." 
These problems would likely eliminate any benefit of parallelization. 

Taking all of the above into consideration, review the code: 

REAL*4 B(100,100) 

DO I = 1, 100 
DO J = 1, 100 

B(I,J) = . . .B (I, J-l) . . . 

ENDDO 

ENDDO 

Assume there are eight threads working on the i loop in parallel. 

The j loop cannot be parallelized because of the dependence. Table 62 on 
page 281 shows how the array maps to cache lines, assuming that 
b (l, l) is on a cache line boundary. Array entries that fall on cache line 
boundaries are in shaded cells. Array entries that fall on cache line 
boundaries are noted by hashmarks($. 
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Table 61 Initial mapping of array to cache lines 


1,1 

1,2 

1,3 

1,4 


1, 99 

1,100 

2, 1 

2,2 

2,3 

2,4 


2, 99 

2,100 

3, 1 

3,2 

3, 3 

3,4 


3, 99 

3,100 

4, 1 

4,2 

4,3 

4,4 


4, 99 

4,100 

5, 1 

5, 2 

5, 3 

5,4 


5, 99 

5,100 

6, 1 

6,2 

6, 3 

6,4 


6, 99 

6,100 

7, 1 

7,2 

7,3 

7, 4 


7, 99 

7,100 

8, 1 

8,2 

8, 3 

8,4 


8, 99 

8,100 

9, 1 

9,2 

9,3 

9,4 


9, 99 

9,100 

10, 1 

10, 2 

10, 3 

10, 4 


10, 99 

10,100 

11, 1 

11,2 

11, 3 

11,4 


11, 99 

11,100 

12, 1 

12, 2 

12, 3 

12,4 


12, 99 

12,100 

13, 1 

13, 2 

13, 3 

13, 4 


13, 99 

13, 100 








97, 1 

97, 2 

97, 3 

97, 4 


97, 99 

97,100 

98, 1 

98, 2 

98, 3 

98, 4 


98, 99 

98,100 

99, 1 

99, 2 

99, 3 

99, 4 


99, 99 

99,100 

100, 1 

100, 2 

100, 3 

100, 4 


100, 99 

100,100 


Array entries surrounded by hashmarks($ are on cache line boundaries. 


HP compilers, by default, give each thread about the same number of 
iterations, assigning (if necessary) one extra iteration to some threads. 
This happens until all iterations are assigned to a thread. Table 62 
shows the default distribution of the i loop across 8 threads. 
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Table 62 


Default distribution of the i loop 


Thread ID 

Iteration range 

Number 
of iterations 

0 

1-12 

12 

1 

13-25 

13 

2 

26-37 

12 

3 

38-50 

13 

4 

51-62 

12 

5 

63-75 

13 

6 

76-87 

12 

7 

88-100 

13 


This distribution of iterations causes threads to share cache lines. For 
example, thread 0 assigns the elements b (9:12, l), and thread 1 
assigns elements b (13 :16,1 ) in the same cache line. In fact, every 
thread shares cache lines with at least oneother thread. Most share 
cache lines with two other threads. This type of sharing is called false 
because it is a result of the data layout and the compiler's distribution of 
iterations. It is not inherent in the algorithm itself. Therefore, it is 
reduced or even removed by: 

1 Restructuring the data layout by aligning data on cache line 
boundaries 

2 Controlling the iteration distribution. 
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Aligning data to avoid false sharing 

Because false cache line sharing is partially due to the layout of the data, 
one step in avoiding it is to adjust the layout. Adjustments are typically 
made by aligning data on cache line boundaries. Aligning arrays 
generally improves performance. However, it can occasionally decrease 
performance. 

The second step in avoiding false cache line sharing is to adjust the 
distribution of loop iterations. This is covered in "Distributing iterations 
on cache line boundaries" on page 283. 

Aligning arrays on cache line boundaries 

Note the assumption that in the previous example, array b starts on a 
cache line boundary. The methods below force arrays in Fortran to start 
on cache line boundaries: 

• Using uninitialized common blocks (blocks with no data statements). 
These blocks start on 64-byte boundaries. 

• Using allocate statements. These statements return addresses on 
64-byte boundaries. This only applies to parallel executables. 

The methods below forcearrays in C to start on cache line boundaries: 

• Using thefunctions maiioc or memory_ciass_maiioc. These 
functions return pointers on 64-byte boundaries. 

• Using uninitialized global arrays or structs that are at least 32 bytes. 
Such arrays and structs are aligned on 64-byte boundaries. 

• Using uninitialized data of theexternai storage class in C that is at 
least 32 bytes. Data is aligned on 64-byte boundaries. 
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Distributing iterations on cache line 
boundaries 

Recall that the default iteration distribution causes thread 0 to work on 
iterations 1-12 and thread 1 to work on iterations 13-25, and soon. Even 
though the cache lines are aligned across the columns of the array (see 
Table 62 on page 281), the iteration distribution still needs to be 
changed. UsethecHUNK_sizE attribute to change the distribution: 

REAL*4 B(112, 100) 

COMMON /ALIGNED/ B 

C$DIR PREFER_PARALLEL (CHUNK_SIZE=16) 

DO I = 1, 100 
DO J = 1, 100 

B(I,J) = . . .B (I,J-l) . . . 

ENDDO 

ENDDO 

You must specify a constant chunk_size attribute. However, the ideal is 
to distribute work so that all but one thread works on the same number 
of whole cache lines, and the remaining thread works on any partial 
cache line. For example, given thefollowing: 

nits = number of iterations 

nthds = number of threads 

lsize =linesize in words (8for 4-bytedata, 4for 8-bytedata, 2 for 
16-byte data) size in words (8 for 4-bytedata 

the ideal chunk_size would be: 

CHUNK_SIZE = LSIZE * (1 + ( (1 + (NITS - 1) / LSIZE ) - 1 )/NTHDS) 

For the code above, these numbers are: 

NITS =100 

lsize =8 (aligns on V2250 boundaries for 4-bytedata) 

NTHDS =8 

CHUNK_SIZE = 8 * (1 + ( (1 + (100 

= 8 * (1 + ( (1 + 12 

= 8 * (1 + ( 12 

= 8 * (1 + 1 

= 16 

chunk_size = 16 causes threads 0, 1,..., 6 to execute iterations 1-16, 

17-32. 81-96, respectively. Thread 7 executes iterations 97-100. As a 

result there is no false cache line sharing, and parallel performance is 
greatly improved. 


- 1 ) / 8 ) - 1 ) / 8 ) 

) - 1 ) / 8 ) 

) / 8 ) 

) 
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You cannot specify the ideal chunk_size for every loop. However, using 

CHUNK_SIZE =X 

where x times the data size (in bytes) is an integral multiple of 32, 
eliminates false cache line sharing. This is only if the following two 
conditions below are met: 

• The arrays are already properly aligned (as discussed earlier in this 
section). 

• The first iteration accesses the first element of each array being 
assigned. For example, in a loop do 1 = 2, n, because the loop 
starts at 1 = 2, the first iteration does not access the first element of 
the array. Consequently, the iteration distribution does not match the 
cache line alignment. 

The number 32 is used because the cache line size is 32 bytes for V2250 
servers. 

Thread-specific array elements 

Sometimes a parallel loop has each thread updatea uniqueelement of a 
shared array, which is further processed by thread 0 outside the loop. 

Consider the following Fortran code in which false sharing occurs: 


REAL*4 S (8 ) 
C$DIR LOOP_PARALLEL 
DO I = 1, N 


S(MY_THREAD()+1) = ... ! EACH THREAD ASSIGNS ONE ELEMENT OF S 


ENDDO 

C$DIR NO_PARALLEL 

DO J = 1, NUM_THREADS() 

= ...S(J) ! THREAD 0 POST-PROCESSES S 

ENDDO 

The problem here is that potentially all the elements of s are in a single 
cacheline, so the assignments cause false sharing. Oneapproach is to 
change the code to force the unique elements into different cache lines, as 
indicated in the foil owing code: 
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REAL*4 S (8,8) 
C$DIR LOOP_PARALLEL 
DO I = 1, N 


S(1,MY_THREAD()+1) = ... ! EACH THREAD ASSIGNS ONE ELEMENT OF S 


ENDDO 

C$DIR NO_PARALLEL 

DO J = 1, NUM_THREADS() 

= ...S(1,J) ! THREAD 0 POST-PROCESSES S 

ENDDO 


Scalars sharing a cache line 

Sometimes parallel tasks assign unique scalar variables that are in the 
same cache line, as in the foil owing code: 

COMMON /RESULTS/ SUM, PRODUCT 
C$DIR BEGIN_TASKS 
DO I = 1, N 


SUM = SUM + 


ENDDO 

C$DIR NEXT_TASK 

DO J = 1, M 


PRODUCT = PRODUCT * 


ENDDO 

C$DIR END_TASKS 
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Working with unaligned arrays 

The most common cache-thrashing complication using arrays and loops 
occurs when arrays assigned within a loop are unaligned with each other. 
There are several possi ble causes for this: 

• Arrays that are local to a routine are allocated on the stack. 

• Array dummy arguments might be passed an element other than the 
first in the actual argument. 

• Array elements might be assigned with different offset indexes. 
Consider the following Fortran code: 

COMMON /OKAY/ X (112, 100) 

CALL UNALIGNED (X(I,J)) 

SUBROUTINE UNALIGNED (Y) 

REAL*4 Y(*) 

! Y(1) PROBABLY NOT ON A CACHE LINE BOUNDARY 

The address of y (l) is unknown. However, if elements of y are heavily 
assigned in this routine, it may be worthwhile to compute an alignment, 
given by the following formula: 

LREM = LSIZE - ( ( 

( LOC (Y (1) )-4, LSIZE*X) + 4) /X) 

where 

lsize is the appropriate cache line size in words 

x is the data size for elements of y 

For this case, lsize on V2250 servers is 32 bytes in single precision 
words (8 words). Note that: 

( ( MOD ( LOC (Y (1) )-4, LSIZE*4) + 4) /4) 

returns a value in the set 1, 2, 3 .lsize, so lrem is in the range 0 to 7. 

Then a loop such as: 

DO I = 1, N 
Y(I) = ... 

ENDDO 
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is transformed to: 

C$DIR NO_PARALLEL 

DO I = 1, MIN (LREM, N) ! 0 <= LREM < 8 

Y(I) = ... 

ENDDO 

C$DIR PREFER_PARALLEL (CHUNK_SIZE = 16) 

DO I = LREM+1, N 

! Y(LREM+1) IS ON A CACHE LINE BOUNDARY 
Y (I) = ... 

ENDDO 

The first loop takes care of elements from the first (if any) partial cache 
line of data. The second loop begins on a cache line boundary, and is 
controlled with chunk_size to avoid false sharing among thethreads. 


Working with dependences 

Data dependences in loops may prevent parallelization and prevent the 
elimination of false cache line sharing. If certain conditions are met, 
some performance gains are achieved. 

For example, consider the following code: 

COMMON /ALIGNED / P (128, 128), Q(128, 128), R(128, 128) 

REAL*4 P, Q, R 
DO J = 2, 128 

DO I = 2, 127 

P(I-1,J) = SQRT (P(I-1,J-1) + 1./3.) 

Q(I ,J) = SQRT (Q(I ,J-1) + 1./3.) 

R(I+1,J) = SQRT (R(1+1,J-l) + 1./3.) 

ENDDO 

ENDDO 

Only the i loop is parallelized, dueto the loop-carried dependences in the 
j loop. It is impossibletodistributethe iterations sothat there is nofalse 
cache line sharing in the above loop. If all loops that refer to these arrays 
always use the same offsets (which is unlikely) then you could make 
dimension adjustments that would allow a better iteration distribution. 

For example, thefollowing would work well for 8threads: 

COMMON /ADJUSTED/ P (128, 128), PAD1(15), Q(128,128), 

> PAD2 (15), R(128, 128) 

DO J = 2, 128 

C$DIR PREFER_PARALLEL (CHUNK_SIZE=16) 

DO I = 2, 127 

P(I-1,J) = SQRT (P (1-1, J-l) + 1 . / 3 . ) 

Q(I ,J) = SQRT (Q(I ,J-1) + 1./3 . ) 

R(I + 1,J) = SQRT (R(1 + 1,J-l) + 1 . /3 . ) 

ENDDO 

ENDDO 
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Padding 60 bytes before the declarations of both q and r causes the 
p (l, j), Q (2, j) , and R (3, J) to be aligned on 64-byte boundaries for all 
j. Combined with a chunk_size of 16, this causes threads to assign 
data to unique whole cache lines. 

You can usually find a mix of all the above problems in some CPU¬ 
intensive loops. You cannot avoid all false cache line sharing, but by 
careful inspection of the problems and careful application of some of the 
workarounds shown here, you can significantly enhancethe performance 
of your parallel loops. 
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Floating-point imprecision 

The compiler applies normal arithmetic rules to real numbers. It 
assumes that two arithmetically equivalent expressions produce the 
same numerical result. 

Most real numbers cannot be represented exactly in digital computers. 

I nstead, these numbers are rounded to a floating-point valuethat is 
represented. When optimization changes the evaluation order of a 
floating-point expression, the results can change. Possible consequences 
of floating-point roundoff include program aborts, division by zero, 
address errors, and incorrect results. 

I n any parallel program, the execution order of the instructions differs 
from the serial version of the same program. This can cause noticeable 
roundoff differences between the two versions. Running a parallel code 
under different machine configurations or conditions can also yield 
roundoff differences, because the execution order can differ under 
differing machine conditions, causing roundoff errors to propagate in 
different orders between executions. Accumulator variables (reductions) 
are especi al ly suscepti bl e to these probl ems. 

Consider the following Fortran example: 

C$DIR GATE(ACCUM_LOCK) 

LK = ALLOC_GATE(ACCUM_LOCK) 


LK = UNLOCK_GATE(ACCUM_LOCK) 
C$DIR BEGIN_TASKS, TASK_PRIVATE(I) 
CALL COMPUTE(A) 

C$DIR CRITICAL_SECTION(ACCUM_LOCK) 
ACCUM = ACCUM + A 
C$DIR END_CRITICAL_SECTION 
C$DIR NEXT_TASK 

DO I = 1, 10000 
B (I) = FUNC (I) 

C$DIR CRITICAL_SECTION(ACCUM_LOCK) 

ACCUM = ACCUM + B(I) 

C$DIR END_CRITICAL_SECTION 


ENDDO 
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C$DIR NEXT_TASK 

DO I = 1, 10000 

X = X + C (I) + D (I) 

ENDDO 

C$DIR CRITICAL_SECTION(ACCUM_LOCK) 

ACCUM = ACCUM/X 
C$DIR END_CRITICAL_SECTION 
C$DIR END_TASKS 

Here, three parallel tasks are all manipulating the real variable accum, 
using real variables which have themselves been manipulated. Each 
manipulation is subject to roundoff error, so the total roundoff error here 
might be substantial. 

When the program runs in serial, the tasks execute in their written 
order, and the roundoff errors accumulate in that order. However, if the 
tasks run in parallel, there is no guarantee as to what order the tasks 
run in. This means that the roundoff error accumulates in a different 
order than it does during the serial run. 

Depending on machine conditions, thetasks may run in different orders 
during different parallel runs also, potentially accumulating roundoff 
errors differently and yielding different answers. 

Problems with floating-point precision can also occur when a program 
tests the value of a variable without allowing enough tolerance for 
roundoff errors. To solve the problem, adjust the tolerances to allow for 
greater roundoff errors or declare the variables to be of a higher precision 
(usethedoubie typeinstead of float in C and C++, or real*8 rather 
than real*4 in Fortran). Testing floating-point numbers for exact 
equality is strongly discouraged. 

Enabling sudden underflow 

By default, PA-RI SC processor hardware represents a floating point 
number in denormalized format when the number is tiny. A floating 
point number is considered tiny if its exponent field is zero but its 
mantissa is nonzero. This practice is extremely costly in terms of 
execution time and seldom provides any benefit. 

You can enablesudden underflow (flush tozero) of denormalized values 
by passing the +fpd flag to the linker. This is done using the-w compiler 
option. 

For more information, refer totheHP-UX Floating-Point Guide. 
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The following example shows an f 90 command line issuing this 
command: 

%f90 -W1,+FPD prog.f 

This command line compiles the program prog, f and instructs the 
linker to enable sudden underflow. 


Invalid subscripts 

An array reference in which any subscript falls outside declared bounds 
for that dimension is called an invalid subscript. I nvalid subscripts are a 
common cause of answers that vary between optimization levels and 
programs that abort and result in a core dump. 

Use the command-line option -c (check subscripts) with f90 to check 
that each subscript is within its array bounds. Seethef90(l) man page 
for more information. The C and aC-H-compilers do not have an option 
corresponding tothe Fortran compiler's -c option. 
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Misused directives and pragmas 

Misused directives and pragmas area common cause of wrong answers. 
Some of the more common misuses of directives and pragmas involve the 
following: 

• Loop-carried dependences 

• Reductions 

• Nondeterminism of parallel execution 

Descriptions of and methods for avoiding the items listed above are 
described in the sections below. 

Loop-carried dependences 

Forcing parallelization of a loop containing a call is safe only if the called 
routine contains no dependences. 

Do not assume that it is always safe to parallelize a loop whose data is 
safe to localize. You can safely localize loop data in loops that do not 
contain a loop-carried dependence (LCD) of theform shown in the 
following Fortran loop: 

DO I = 2, M 
DO J = 1, N 

A(I,J) = A(I+IADD,J+JADD) + B(I,J) 

ENDDO 

ENDDO 

where one of iadd and jadd is negative and the other is positive. This is 
explained in detail in the section "Conditions that inhibit data 
localization"on page59. 

You cannot safely parallelize a loop that contains any kind of LCD, 
except by using ordered sections around the LCDs as described in the 
section "Ordered sections" on page 255. Also see the section "Inhibiting 
parallelization"on page 105. 
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The main section of the Fortran program below initializes a, calls calc, 
and outputs the new array values. I n subroutine calc, the indirect index 
used in a (in (i) ) introduces a potential dependence that prevents the 
compiler from parallelizing calc's i loop. 

PROGRAM MAIN 
REAL A(1025) 

INTEGER IN(1025) 

COMMON /DATA/ A 
DO I = 1, 1025 
IN(I) = I 
ENDDO 

CALL CALC(IN) 

CALL OUTPUT(A) 

END 


SUBROUTINE CALC(IN) 

INTEGER IN(1025) 

REAL A(1025) 

COMMON /DATA/ A 
DO I = 1, 1025 
A (I) = A (IN (I) ) 

ENDDO 

RETURN 

END 

Because you know that in (i) = i, you can use the 
no_loop_dependence directive, as shown below. This directive allows 
the compiler to ignore the apparent dependence and parallelize the loop, 
when compiling with +03 +oparaiiei. 

SUBROUTINE CALC(IN) 

INTEGER IN(1025) 

REAL A(1025) 

COMMON /DATA/ A 
C$DIR NO_LOOP_DEPENDENCE(A) 

DO I = 1, 1025 

A(I) = A(IN(I)) 

ENDDO 

RETURN 

END 
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Reductions 

Reductions area special class of dependence that the compiler can 
parallelize. An apparent LCD can prevent the compiler from 
parallelizing a loop containing a reduction. 

The loop in the following Fortran example is not parallelized because of 
an apparent dependence between the references to a (i) on line 6 and 
the assignment to a ( ja (j) ) on line 7. The compiler does not realize that 
the values of the elements of ja never coincide with the values of i. 
Assuming that they might collide, the compiler conservatively avoids 
parallelizing the loop. 

DO I = 1,100 
JA(I) = I + 10 
ENDDO 

DO I = 1, 100 
DO J = I, 100 

A(I) = A(I) + B(J) * C(J) !LINE 6 

A (JA (J) ) = B (J) + C(J) ! LINE 7 

ENDDO 
ENDDO 

In this example, as well as the examples that follow, the apparent 
dependence becomes real if any of the values of the elements of ja are 
equal to the values iterated over by i. 

A no_ioop_dependence directive or pragma placed before the j loop 
tells the compiler that the indirect subscript does not cause a true 
dependence. Because reductions are a form of dependence, this directive 
also tells the compiler to ignore the reduction on a (i) , which it would 
normally handle. Ignoring this reduction causes the compiler to generate 
incorrect code for the assignment on line 6. Theapparent dependence on 
line 7 is properly handled because of the directive. The resulting code 
runs fast but produces incorrect answers. 
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Tosolvethis problem, distribute the j loop, isolating the reduction from 
the other statements, as shown in thefollowing Fortran example: 

DO I = 1, 100 
DO J = I, 100 

A (I) = A (I) + B (J) * C (J) 

ENDDO 

ENDDO 

C$DIR NO_LOOP_DEPENDENCE(A) 

DO I = 1, 100 
DO J = I, 100 

A (JA (J) ) = B (J) + C (J) 

ENDDO 

ENDDO 

The apparent dependence is removed, and both loops are optimized. 


Nondeterminism of parallel execution 

I n a parallel program, threads do not execute in a predictable or 
determined order. If you force the compiler to parallelize a loop when a 
dependence exists, the results are unpredictable and can vary from one 
execution to the next. 

Consider thefollowing Fortran code: 

DO I = 1, N-l 

A (I) = A (1 + 1) * B (I) 


ENDDO 

The compiler does not parallelize this code as written because of the 
dependence on a(i) . This dependence requires that the original value of 
a ( i + i) be avail able for the computation of a (i ). 

If this code was parallelized, some values of a would be assigned by some 
processors before they were used by others, resulting in incorrect 
assignments. 

Because the results depend on the order in which statements execute, 
the errors are nondeterministic. The loop must therefore execute in 
iteration order toensurethat all values of a are computed correctly. 

Loops containing dependences can sometimes be manually parallelized 
using the loop_parallel (ordered) directive as described in "Parallel 
synchronization" on page 243. Unless you are sure that no loop-carried 
dependence exists, it is safest to let the compiler choose which loops to 
parallelize. 
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Triangular loops 

A triangular loop is a loop nest with an inner loop whose upper or lower 
bound (but not both) is a function of theouter loop’s index. Examples of a 
lower triangular loop and an upper triangular loop are given below. To 
simplify explanations, only Fortran examples are provided in this 
section. 


Lower triangular loop 

DO J = 1, N 
DO I = J+l, N 

F (I) = F (I) + ... + X (I, J) + 

J 



Elements 
referenced 
in array X 
(shaded cells) 
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Upper triangular loop 

DO J = 1, N 
DO I = 1, J-l 

F (I) = F (I) + ... + X (I, J) + 


J 



Elements 
referenced 
in array X 
(shaded cells) 


While the compiler can usually auto-parallelize one of the outer or inner 
loops, there are typically performance problems in either case: 

• If the outer loop is parallelized by assigning contiguous chunks of 
iterations to each of the threads, the load is severely unbalanced. For 
example, in the lower triangular example above, thethread doing the 
last chunk of iterations does far less work than thethread doing the 
first chunk. 

• If the inner loop is auto-parallelized, then on each outer iteration in 
the j loop, the threads are assigned to work on a different set of 
iterations in the i loop, thus losing access to some of thei r previously 
encached elements of f and thrashing each other's caches in the 
process. 

By manually controlling the parallelization, you can greatly improve the 
performance of a triangular loop. Parallelizing the outer loop is generally 
more beneficial than parallelizing the inner loop. The next two sections 
explain how to achieve the enhanced performance. 
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Parallelizing the outer loop 

Certain directives allow you to control the parallelization of the outer 
loop in a triangular loop to optimize the performance of the loop nest. 

For the outer loop, assign iterations to threads in a balanced manner. 
The simplest method is to assign the threads one at a time using the 
chunk_size attribute: 

C$DIR PREFER_PARALLEL (CHUNK_SIZE = 1) 

DO J = 1, N 
DO I = J+l, N 

Y (I, J) = Y (I, J) + . . .X (I, J) . . . 

This causes each thread to execute in thefollowing manner: 

DO J = MY_THREAD() + 1, N, NUM_THREADS() 

DO I = J+l, N 

Y (I, J) = Y (I, J) + . . .X (I, J) . . . 

where 0 <=my_thread () <num_threads () 

I n this case, the first thread still does more work than the last, but the 
imbalance is greatly reduced. For example, assumeN =128 and there 
are 8 threads. Then thedefault parallel compilation would causethread 
0 to do j = 1 to 16, resulting in 1912 inner iterations, whereas thread 7 
does j = 113 to 128, resulting in 120 inner iterations. With 
chunk_size = l, thread 0 does 1072 inner iterations, and thread 7 does 
1023. 

Parallelizing the inner loop 

If the outer loop cannot be parallelized, it is recommended that you 
parallelize the inner loop if possible. There are two issues to be awareof 
when parallelizing the inner loop: 

• Cache thrashing 

Consider the parallelization of thefollowing inner loop: 

DO J = 1+1, N 

F(J) = F(J) + SQRT(A(J)**2 - B(I)**2) 

where i varies in the outer loop iteration. 
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The default iteration distribution has each thread processing a 
contiguous chunk of iterations of approximately the same number as 
every other thread. The amount of work per thread is about the same; 
however, from one outer iteration to the next, threads work on 
different elements in f, resulting in cache thrashing. 

• Theoverhead of parallelization 

If the loop cannot be interchanged to be outermost (or at least 
outermore), then theoverhead of parallelization is compounded by 
the number of outer loop iterations. 

The scheme below assigns "ownership” of elements to threads on a cache 
line basis so that threads always work on the same cache lines and retain 
data locality from one iteration to the next. I n addition, the parallel 
directive is used to spawn threads just once. The outer, nonparallel loop 
is replicated on all processors, and the inner loop iterations are manually 
distributed to the threads. 

C F IS KNOWN TO BEGIN ON A CACHE LINE BOUNDARY 
NTHD = NUM_THREADS() 

CHUNK = 8 ! CHUNK * DATA SIZE (4 BYTES) 

! EQUALS PROCESSOR CACHE LINE SIZE; 

! A SINGLE THREAD WORKS ON CHUNK = 8 
! ITERATIONS AT A TIME 

NTCHUNK = NTHD * CHUNK ! A CHUNK TO BE SPLIT AMONG THE THREADS 

C$DIR PARALLEL,PARALLEL_PRIVATE(ID,JS,JJ,J,I) 

ID = MY_THREAD() + 1 ! UNIQUE THREAD ID 

DO I = 1, N 

JS = ((1+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK 

> + (ID-1) * CHUNK + 1 

DO JJ = JS, N, NTCHUNK 

DO J = MAX (JJ, 1+1), MIN (N, JJ+CHUNK-1) 

F(J) = F(J) + SQRT(A(J)**2 - B(I)**2) 

ENDDO 

ENDDO 

ENDDO 

C$DIR END_PARALLEL 


The idea is to assign a fixed ownership of cache lines of f and to assign a 
distribution of those cache lines to threads that keeps as many threads 
busy computing whole cache lines for as long as possible. Using 
chunk = 8 for 4-byte data makes each thread work on 8 iterations 
covering a total of 32 bytes—the processor cache line size for V2250 
servers. 


Chapter 14 


299 




Troubleshooting 

Triangular loops 


In general, set chunk equal tothe smallest valuethat multiplies by the 
data size to give a multiple of 32 (the processor cache line size on V2250 
servers). Smaller values of chunk keep most threads busy most of the 
time. 

Because of the ever-decreasing work in thetriangular loop, there are 
fewer cachelines left to compute than there are threads. Consequently, 
threads drop out until there is only one thread left to compute those 
iterations associated with the last cache line. Compare this distribution 
to the default distribution that causes false cache line sharing and 
consequent thrashing when all threads attempt to compute data into a 
few cachelines. See "False cache line sharing" on page279 in this 
chapter. 

The scheme above maps a sequence of NTCHUNK-sized blocks over the f 
array. Within each block, each thread owns a specific cache line of data. 
The relationship between data, threads, and blocks of sizeNTCHUNK is 
shown in Figure 19 on page 301. 
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Figure 19 


Data ownership by chunk and ntchunk blocks 


NTCHUNK 1 


CHUNKS of F 

Associated 

thread 

F (1) . 

. . F(8) 

thread 0 

F (9) . 

. . F(16) 

thread 1 

F (1 7 ) . 

. . F(24) 

thread 2 

F (33) . 

. . F(40) 

thread 3 

F ( 4 1) . 

. . F(48) 

thread 4 

F (25) . 

. . F(32) 

thread 5 

F (4 9) . 

. . F(56) 

thread 6 

F (57) . 

. . F(64) 

thread 7 


NTCHUNK 2 


CHUNKS of F 

Associated 

thread 

F (65) ... F(72) 

thread 0 

F (73) ... F(80) 

thread 1 

F (81) ... 



chunk is the number of iterations a thread works on at onetime. The 
idea is to make a thread work on the same elements of f from one 
iteration of i to the next (except for those that are already complete). 

The scheme above causes thread 0 to do all work associated with the 
cache lines starting at f (l), f (1+ntchunk) , f (1+2*ntchunk) , and so 
on. Likewise, thread 1 does the work associated with thecache lines 
starting at f (9), f (9+ntchunk) , f (9+2*ntchunk) , and soon. 
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If a thread assigns certain elements of f for i = 2, then it is certain that 
the same thread encached those elements of f in iteration i =1. This 
eliminates cache thrashing among the threads. 


Examining the code 

Having established the idea of assigning cache line ownership, consider 
the foil owing Fortran code in more detail: 

C$DIR PARALLEL,PARALLEL_PRIVATE(ID,JS,JJ,J,I) 

ID = MY_THREAD() + 1 ! UNIQUE THREAD ID 

DO I = 1, N 

JS = ((1+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK 

> + (ID-1) * CHUNK + 1 

DO JJ = JS, N, NTCHUNK 

DO J = MAX (JJ, 1+1), MIN (N, JJ+CHUNK-1) 

F(J) = F(J) + SQRT(A(J)**2 - B(I)**2) 

ENDDO 

ENDDO 

ENDDO 

C$DIR END_PARALLEL 


C$DIR PARALLEL, PARALLEL_PRIVATE(ID,JS,JJ,J,I) 

Spawns threads, each of which begins executing the 
statements in the parallel region. Each thread has a 
private version of the variables id, js, jj, j, and i. 

ID = MY_THREAD() + 1 ! UNIQUE THREAD ID 

Establishes a unique id for each thread, in the 
range 1 to num_threads () . 

DO I = 1, N 

Executes all threads of the i loop redundantly (instead 
of thread 0 executing it alone). 

JS = ( (I +1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK 

+ (ID-1) * CHUNK + 1 

Determines, for a given value of i+i, which ntchunk 
the value i + i falls then. Then it assigns a unique 
chunk of it to each thread id. Suppose that there are 
ntc ntchunks, where ntc is approximately n/ntchunk. 
Then the expression: 
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(I + l + NTCHUNK-l - ID *CHUNK ) / NTCHUNK) 

returns a value in the range 1 to ntcfor a given value of 
i+l. Then the expression: 

((I+l + NTCHUNK-l - ID*CHUNK ) / NTCHUNK) * NTCHUNK 

identifies the start of an ntchunk that contains i + l or 
is immediately above i + l for a given value of id. 

For the ntchunk that contains i+l, if the cache lines 
owned by a thread either contain i + l or are above i + l 
in memory, this expression returns this ntchunk. If the 
cache lines owned by a thread are below i + l in this 
ntchunk, this expression returns the next highest 
ntchunk. I n other words, if there is no work for a 
particular thread to do in this ntchunk, then start 
working in the next one. 

(ID-1) * CHUNK + 1 

identifies the start of the particular cache line for the 
thread to compute within this ntchunk. 

DO JJ = JS, N, NTCHUNK 

runs a uniqueset of cache lines starting at its specific 
js and continuing into succeeding ntchunks until all 
the work is done. 

DO J = MAX (JJ, I+l), MIN (N, JJ+CHUNK-1) 

performs the work within a single cache line. If the 
starting index (i + l) is greater than thefirst element in 
the cache line (js) then start with i + l. If theending 
index (n) is less than the last element in the cache line, 
then finish with n. 

The following are observations of the preceding loops: 

• Most of the "complicated” arithmetic is an outer loop iterations. 

• You can replace divides with shift instructions because they involve 
powers of two. 

• If this application were to be run on an V2250 single-node machine, it 
would be appropriate to choose a chunk size of 8 for 4-byte data. 
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Compiler assumptions 

Compiler assumptions can produce faulty optimized code when the 
source code contai ns: 

• Iterations by zero 

• Trip counts that may overflow at optimization levels +02 and above 

Descriptions of, and methods for, avoiding the items listed above are in 
the following sections. 

I ncrementing by zero 

The compiler assumes that whenever a variable is being incremented on 
each iteration of a loop, the variable is being incremented by a loop- 
invariant amount other than zero. If the compiler parallelizes a loop that 
increments a variable by zeroon each trip, the loop can produce incorrect 
answers or cause the program to abort. This error can occur when a 
variable used as an incrementation value is accidentally set to zero. If 
the compiler detects that the variable has been set to zero, the compiler 
does not parallelize the loop. If thecompiler cannot detect the 
assignment, however, the symptoms described below occur. 

The following Fortran code shows two loops that increment by zero: 

CALL SUB1(0) 


SUBROUTINE SUBl(IZR) 

DIMENSION A (100), B(100), C(100) 

J = 1 

DO I = 1, 100, IZR ! INCREMENT VALUE OF 0 IS 
! NON-STANDARD 

A (I) = B ( I ) 

ENDDO 

PRINT *, A(11) 

DO I = 1, 100 
J = J + IZR 
B (I) = A (J) 

A (J) = C(I) 

ENDDO 

PRINT *, A(1) 

PRINT *, B(11) 

END 
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Because izr is an argument passed tosuBi, thecompiler does not detect 
that izr has been set to zero. Both loops parallelize at 

+03 +Oparallel +Onodynsel. 

The loops compile at +03, butthefirst loop, which specifies the step as 
part of the do statement (or as part of the for statement in C), attempts 
to parcel out loop iterations by a step of izr. At runtime, this loop is 
infinite. 

Due to dependences, the second loop would not behave predictably when 
parallelized—if it were ever reached at runtime. Thecompiler does not 
detect the dependences because it assumes j is an induction variable. 


Trip counts that may overflow 

Some loop optimizations at +02 and above may cause the variable on 
which thetrip count is based to overflow. A loop's trip count is the 
number of times the loop executes. Thecompiler assumes that each 
induction variable is increasing (or decreasing) without overflow during 
theloop. Any overflowing induction variablemay beused by thecompiler 
as a basis for thetrip count. The following sections discuss when this 
overflow may occur and how to avoid it. 

Linear test replacement 

When optimizing loops, thecompiler often disregards the original 
induction variable, using instead a variableor valuethat better indicates 
the actual stride of the loop. A loop's stride is the value by which the 
iteration variable increases on each iteration. By picking the largest 
possible stride, thecompiler reduces the execution time of the loop by 
reducing the number of arithmetic operations within each iteration. 

The Fortran code below contains an exampleof a loop in which the 
induction variable may be replaced by thecompiler: 

ICONST = 64 
HOT = 0 
DO IND = 1,N 

IPACK = (IND*1024)*IC0NST**2 

IF(IPACK .LE. (N/2)* 1024 *ICONST**2) 

> HOT = ITOT + IPACK 


ENDDO 

END 


Chapter 14 


305 




Troubleshooting 

Compiler assumptions 


Executing this loop using ind as the induction variable with a stride of 1 
would be extremely inefficient. Therefore, the compiler picks ipack as 
theinduction variableand usestheamount by which it increases on each 
iteration, 1024*64 2 or 2 22 , as the stride. 

The trip count (n in the example), or just trip, is the number of times the 
loop executes, and the start value is the initial value of the induction 
variable. 

Linear test replacement, a standard optimization at levels +02 and 
above, normally does not cause problems. However, when the loop stride 
is very large a large trip count can cause the loop limit value 
(start-K(trip-l)*stride)) to overflow. 

I n the code above, the induction variable is a 4-byte integer, which 
occupies 32 bits in memory. That means if start-K(trip-l)*stride) (1-K(n- 
1)*2 22 )) is greater than 2-1, the value overflows into the sign bit and is 
treated as a negative number. If thestride value is negative, the absolute 
value of start-K(trip-l)*stride) must be not exceed 2 31 . When a loop has a 
positive stride and the trip count overflows, the loop stops executing 
when the overflow occurs because the limit becomes negative—assuming 
a positive stride—and the termination test fails. 

Because the largest allowable value for start-K(trip-l)*stride) is 2 31 -1, 
the start value is 1, and thestride is 2 22 , the maximum trip count for the 
loop is found. 
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NOTE 


The stride, trip, and start values for a loop must satisfy the foil owing 
inequality: 

start + ((trip - l) * stride) < 2 31 


The start value is 1, so trip is solved as follows: 


start + ((trip - 

l) * stride) 

< 

2 31 



1 + (trip - 

- 1 ) * 2 22 

< 

2 31 



(trip 

- 1 ) * 2 22 

< 

2 31 - 

1 



trip - 1 

< 

2 9 - 

2 " 

■22 


trip 

< 

2 9 - 

2 " 

■22 + 


trip 

< 

512 



The maximum value for n 

in the given loop 

1 , then, 

is 

512. 


If you find that certain loops give wrong answers at optimization levels +02 
or higher, the problem may be test replacement. If you still want to optimize 
these loops at +02 or above, restructure them to force the compiler to 
choose a different induction variable. 

Large trip counts at +02 and above 

When a loop is optimized at level +02 or above, its trip count must 
occupy no more than a signed 32-bit storage location. The largest positive 
valuethat can fit in this space is 2 31 - 1 (2,147,483,647). Loops with trip 
counts that cannot be determined at compile time but that exceed 2 31 -1 
at runtime yield wrong answers. 

This limitation only applies at optimization levels +02 and above. 

A loop with a trip count that overflows 32 bits is optimized by manually 
strip mining the loop. 
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Porting CPSlib functions to 
pthreads 


I ntroduction 

The Compiler Parallel Support Library (CPSlib) is a library of thread 
management and synchronization routines that was initially developed 
tocontrol parallelism on HP's legacy multinodesystems. Most programs 
fully exploited their parallelism using higher-level devices such as 
automatic parallelization, compiler directives, and message-passing. 
CPSlib, however, provides a lower-level interface for the few cases that 
required it. 

With the introduction of the V2250 series server, HP recommends the 
useof POSIX threads (pthreads) for purposes of thread management and 
parallelism. Pthreads provide portability for programmers who want to 
usetheir applications on multiple platforms. 

This appendix describes how CPSlib functions map to pthread functions, 
and how to write a pthread program to perform the same tasks as CPSlib 
functions. Topics included in this chapter include: 

• Accessing pthreads 

• Symmetric parallelism 

• Asymmetric parallelism 

• Synchronization using high-level functions 

• Synchronization using low-level functions 
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Accessing pth reads 

When you use pthreads routines, your program must includethe 
<pthread.h> header fileand the pthreads library must be explicitly 
linked to your program. 

For example, assumethe program prog.c contains calls to pthreads 
routines. Tocompilethe program so that it links in the pthreads library, 
issue the following command: 

% cc -D_POSIX_C_SOURCE=199506L prog.c -lpthread 

The -d_posix_c_source=i 995o6 l string indicates the appropriate 
POSIX revision level. I n the example above, the level is indicated as 
199506L. 
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Mapping CPSlib functions to pthreads 

Table 63 shows the mapping of the CPSlib functions to pthread 
functions. Where applicable, a pthread function is listed as 
corresponding tothe appropriate CPSlib function. For instances where 
there is no corresponding pthread function, pthread examples that 
mimic CPSlib functionality are provided. 

TheCPSlibfunctions aregrouped by type: barriers, informational, low- 
level locks, low-level counter semaphores, symmetries and asymmetries, 
and mutexes. 


Table 63 CPSlib library functions to pthreads mapping 


CPSlib 

Maps to pthread 

function 

function 

Symmetric parallel functions 

cps_nsthreads 

N/A 


See "Symmetric parallelism"on page 318 for more 
information. 

cps_ppcall 

N/A 


See "Symmetric parallelism"on page 318 for more 
information. Nesting is not supported in this example. 

cps_ppcalln 

N/A 


See "Symmetric parallelism"on page 318 for more 
information. 

cps_ppcallv 

N/A 


No example provided. 

cps_stid 

N/A 


See "Symmetric parallelism"on page 318 for more 
information. 
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CPSlib 

function 

Maps to pthread 
function 

cps_wait_attr 

N/A 

See "Symmetric parallelism"on page 318 for more 
information. 

Asymmetric parallel functions 


cps_thread_create 

pthread_create 

See "Asymmetric parallel ism” on page 329 for more 
information. 

cps_thread_createn 

pthread_create 

Only supports passing of one argument. 

See "Asymmetric parallel ism" on page 329 for more 
information. 

cps_thread_exit 

pthread_exit 

See "Asymmetric parallel ism" on page 329 for more 
information. 

cps_thread_register_lock 

This function was formerly used in conjunction with 
m_iock. It is now obsolete, and is replaced with one call 

to pthread_join. 

See "Asymmetric parallel ism" on page 329 for more 
information. 

cps_thread_wait 

N/A 

No example available. 

1 nformational 


cp s_comp 1 ex_cpu s 

pthread_num_processors_np 

The HP pthread_num_processors_np function returns 
the number of processors on the machine. 
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CPSlib 

function 

Maps to pthread 
function 

cps_complex_nodes 

N/A 

This functionality can be added using the appropriate 
calls in your ppcaii code. 

cps_complex_nthreads 

N/A 

This functionality can be added using the appropriate 
calls in your ppcaii code. 

cps_is_parallel 

N/A 

See the ppcaii. c example on page 318 for more 
information. 

cps_plevel 

Because pthreads have no concept of levels, this function 
is obsolete. 

cps_set_threads 

N/A 

See the ppcaii. c example on page 318 for more 
information. 

cps_topology 

U se pthread_num_processors_np () to set up your 
configuration as a single-node machine. 

Synchronization using high-level barriers 

cps_barrier 

N/A 

See themy_barrier. c example in on page 332 for more 
information. 

cps_barrier_alloc 

N/A 

See themy_barrier. c example in on page 332 for more 
information. 
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CPSlib 

function 

Maps to pthread 
function 

cps_barrier_free 

N/A 

See themy_barrier. c example in on page 332 for more 
information. 

Synchronization using high-level mutexes 

cps_limited_spin_mutex_a 

Hoc 

pthread_mutex_init 

The CPS mutex allocate functions allocated memory and 
initialized the mutex. When you use pthread mutexes, 
you must usepthread_mutex_init to allocate the 
memory and initialize it. 

See pth_mutex. c on page 332 for a description of using 
pthreads. 

cps_mutex_alloc 

pthread_mutex_init 

The CPS mutex allocate functions allocated memory and 
initialized the mutex. When you use pthread mutexes, 
you must usepthread_mutex_init to allocate the 
memory and initialize it. 

See pth_mutex. c on page 332 for a description of using 
pthreads. 

cps_mutex_free 

pthread_mutex_destroy 

cps_mutex_free formerly uninitalized the mutex, and 
called free to release memory. When using pthread 
mutexes, you must first call pthread_mutex_destroy. 

See pth_mutex. c on page 332 for a description of using 
pthreads. 

cps_mutex_lock 

pthread_mutex_lock 

See pth_mutex. c on page 332 for a description of using 
pthreads. 
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CPSlib 

function 

Maps to pthread 
function 

cps_mutex_trylock 

pthread_mutex_trylock 

See pth_mutex. c on page 332 for a description of using 
pthreads. 

cps_mutex_unlock 

pthread_mutex_unlock 

See pth_mutex. c on page 332 for a description of using 
pthreads. 

Synchronization using low-level locks 

[me]_cond_lock 

pthread_mutex_trylock 

[me]_free32 

pthread_mutex_destroy 

cps_mutex_free formerly uninitalized the mutex, and 
called free to release memory. When using pthread 
mutexes, you must call pthread_mutex_destroy. 

[me]_init32 

pthread_mutex_init 

[me]_lock 

pthread_mutex_lock 

[me]_unlock 

pthread_mutex_unlock 

Synchronization using low-level counter semaphores 

[me]_fetch32 

N/A 

See f etch_and_inc. c example on page 337 for a 
description of using pthreads. 

[me]_fetch_and_add32 

N/A 

See f etch_and_inc. c example on page 337 for a 
description of using pthreads. 

[me]_fetch_and_clear32 

N/A 

See f etch_and_inc. c example on page 337 for a 
description of using pthreads. 
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CPSlib 

Maps to pthread 

function 

function 

[me]_fetch_and_dec32 

N/A 


See f etch_and_inc. c example on page 337 for a 
description of using pthreads. 

[me]_fetch_and_inc32 

N/A 


See f etch_and_inc. c example on page 337 for a 
description of using pthreads. 

[me]_fetch_and_set32 

N/A 


See f etch_and_inc. c example on page 337 for a 
description of using pthreads. 

[me]_init32 

N/A 


See f etch_and_inc. c example on page 337 for a 
description of using pthreads. 
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E nvi ronment var i ables 

U nlike CPSlib, pthreads does not use environment variables to establish 
thread attributes, pthreads implements function cal Is to achieve the 
same results. However, when using the H P compiler set, the 
environment variables below must be set to define attributes. 

The table below describes the environment variables and how pthreads 
handles the same or similar tasks. 


The environment variables below must beset for use with the HP 
compilers if you are not explicitly using pthreads. 

Table 64 CPSlib environment variables 


E nvironment variable 

Description 

How handled by pthreads 

MP_NUMBER_OF_THREADS 

Sets the number of 
threads that the 
compiler allocates at 
startup time. 

By default, under HP-UX you can 
create more threads than you 
have processors for. 

MP_IDLE_THREADS_WAIT 

1 ndicates how idle 
compiler threads 
should wait. 

The values can be: 

-1 - spin wait; 

0 - suspend wait; 

N - spin suspend where N >0. 

CPS_STACK_SIZE 

Tells the compiler 
what size stack to 
allocate for all it's 
child threads. The 
default stacksize is 80 

M byte. 

Pthreads allow you to set the 
stack size using attributes. The 
attribute call is 

pthread_attr_setstacksize. 
The value of cps_stack_size is 

specified in Kbytes. 
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Using pthreads 

SomeCPSlibfu notions map directly to existing pthread functions, as 
shown inTable63 on page 311. However, certain CPSlib functions, such 
as cps_pievei, are obsolete in the scope of pthreads. While about half of 
the CPSlib functions do not map to pthreads, their tasks can be 
simulated by the programmer. 

The examples presented in the following sections demonstrate various 
constructs that can be programmed to mimic unmappable CPSlib 
functions in pthreads. The examples shown here are provided as a first 
step in replacing previous functionality provided by CPSlibwith POSIX 
thread standard calls. 

This is not a tutorial in pthreads, nor do these examples describe 
complex pthreads operations, such as nesting. For a definitive 
description of how to use pthreads functions, see the book Threadtime by 
Scott Norton and Mark D. Dipasquale. 

Symmetric parallelism 

Symmetric parallel threads are spawned in CPSlib using cps_ppcaii () 
or cps_ppcaiin (). There is no logical mapping of these CPSlib 
functions to pthread functions. However you can create a program, 
similar to the one shown in theppcaii. c example below, to achieve the 
same results. 

This example also includes the following CPSlibthread information 
functions: 

• my_nsthreads (a map created for cps_nthreads) returns the 
number of threads in the current spawn context. 

• my_stid (a map created for cps_stid) returns the spawn thread ID 
of the calling thread. 

The ppcaii. c example performs other tasks associated with 
symmetrical thread processing, including the foil owing: 

• Allocates a cell barrier data structure based upon the number of 
threads in the current process by calling my_barrier_aiioc 
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• Provides a barrier for threads to "join” or synchronize after parallel 
work is completed by calling my_join_barrier 

• Creates data structures for threads created using pthread_create 

• Uses the cps_stack_size environment variableto determinethe 
stacksize 

• Determines the number of threads to create by calling 

pthread_num_processors_np() 

• Returns the number of threads by calling my_nsthreads ( ) 

• Returns the is_parallel flag by calling my_is_parallel ( ) 

ppcall.c 

/* 

* ppcall.c 

* function 

* Symmetric parallel interface to using pthreads 

* called my_thread package. 

■k 

* / 

#ifndef _HPUX_SOURCE 
tdefine _HPUX_SOURCE 
#endif 

#include <spp_prog_model.h> 

#include <pthread.h> 

#include <stdlib.h> 

#include <errno.h> 

#include "my_ppcall.h" 

tdefine K 1024 

tdefine MB K*K 

struct thread_data { 
int stid; 

int nsthreads; 

int release_flag; r}; 

}; 

typedef struct thread_data thread_t; 
typedef struct thread_data *thread_p; 

tdefine WAIT_UNKNOWNO 
tdefine WAIT_SPIN1 
tdefine WAIT_SUSPEND2 

#define MAX_THREADS64 

tdefine W_CACHE_SIZE 8 
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tdefine B_CACHE_SIZE 32 


typedef struct { 

int volatile c_cell; 

int c_pad[W_CACHE_SIZE-1]; 

} cell_t; 

tdefine ICELL_SZ (sizeof(int)*3+sizeof(char *)) 


struct cell_barrier 
int 

int volatile 

char * 

int 

char 

cell_t 


br_c_magic; 
br_c_release; 
br_c_free_ptr; 
br_c_cell_cnt; 

br_c_pad[B_CACHE_SIZE-ICELL_SZ]; 
br_c_cells[1]; 


tdefine BR_CELL_T_SIZE(x) (sizeof(struct cell_barrier) + 

(sizeof(cell_t)*x)) 

/* 

* ALIGN - to align objects on specific alignments (usually on 

* cache line boundaries. 

■k 

* arguments 

* obj- pointer object to align 

* alignment- alignment to align obj on 

■k 

* Notes: 

* We cast obj to a long, so that this code will work in 

* either narrow or wide modes of the compilers. 

*/ 

#define ALIGN(obj, alignment)\ 

((((long) obj) + alignment - 1) & -(alignment - 1)) 

typedef struct cell_barrier * cell_barrier_t; 

/* 

* File Variable Dictionary: 

■k 

* my_thread_mutex- mutex to control access to the following: 

* my_func, idle_release_flag, my_arg, 

* my_call_thread_max, my_threads_are_init, 

* mv threads are parallel. 

- flag to release spinning 
idle threads 

- user specified function to call 

- argument to pass to my_func 

- maximum number of threads 
needed on this ppcall 

- my thread package init flag 

- we are executing parallel 
code flag 

- list of child thread ids 


idle_release_flag 

my_func 
my_arg 

my_call_thread_max 

my_threads_are_init 

my_threads_are_parallel 

my_thread_ids 
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my_barrier 
my_t h r e a d_p t r 


■k 


*/ 


- barrier used by the join 

- the current thread thread 

- pointer in thread-private 
memory. 


pthread_mutex_tmy_thread_mutex 


static 

PTHREAD_MUTEX_INITIALIZER; 
static 
static 
static void 
static int 
static int 
static int 
static int 
static int volatile 
static pthread_t 
static cell_barrier_t 


idle_release_flag = 0; 
(*my_func) (void *) ; 

*my_arg; 

my_call_thread_max; 
my_stacksize = 8*MB; 
thread_count = 1; 
my_threads_are_init = 0; 
my_threads_are_parallei = 0; 
my_thread_ids[MAX_THREAD S]; 
my_barrier; 


int volatile 
void 


static thread_p thread_private my_thread_ptr; 

/* 

* my_barrier_alloc 

* Allocate cell barrier data structure based upon the 

* number of threads that are in the current process. 

•k 

* arguments 

* brc - pointer pointer to the user cell barrier 

* n - number of threads that will use this barrier 

•k 

* return 

* 0- success 

* -1- failed to allocate cell barrier 


static int 

my_barrier_alloc(cell_barrier_t *brc, int n) 

{ 

cell_barrier_t b; 
char *p; 
int i; 


/* 

* Allocate cell barrier for 'n' threads 
*/ 

if ( (p = (char *) malloc(BR_CELL_T_SIZE(n))) == 0 ) 
return -1; 

/* 

* Align the barrier on a cache line for maximum 
performance. 

*/ 
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b = (cell_barrier_t) ALIGN(p, B_CACHE_SIZE); 
b->br_c_magic = 0x4200beef; 

b->br_c_cell_cnt = n; /* keep track of the # of threads */ 
b->br_c_release = 0; /* initialize release flag */ 

b->br_c_free_ptr = p; /* keep track of orginal malloc ptr */ 

for(i =0; i < n; i++ ) 

b->br_c_cells[i].c_cell = 0;/* zero the cell flags */ 

*brc = b; 
return 0; 


* my_join_barrier 

* Provide a barrier for all threads to sync up at, after 

* they have finished performing parallel work. 

■k 

* arguments 

* b - pointer to cell barrier 

* id - id of the thread (need to be in the 

* range of 0 - (N-l), where N is the 
^number of threads). 


* return 
*none 

*/ 

static void 

my_join_barrier(cell_barrier_t b, int id) 

{ 

int i, key; 

/* 

* Get the release flag value, before we signal that we 

* are at the barrier. 

*/ 

key = b->br_c_release; 

if ( id == 0 ) { 

/* 

* make thread 0 (i.e. parent thread) wait for the child 

* threads to show up. 

*/ 

for( i=l; i < thread_count; i + + ) { 

/* 

* wait on the Nth cell 
*/ 

while ( b->br_c_cells[i].c_cell == 0 ) 

/* spin */; 

/* 

* We can reset the Nth cell now, 

* because it is not being used anymore 

* until the next barrier. 

/* 
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b->br_c_cells[i].c_cell = 0; 

} 


/* 

* signal all of the child threads to leave the barrier. 

*/ 

++b->br_c_release; 

} else { 

/* 

* signal that the Nth thread has arrived at the barrier. 

*/ 

b->br_c_cells[id].c_cell = -1; 

while ( key == b->br_c_release ) 

/* spin */; 

} 

} 

/* 

* idle_threads 

* All of the process child threads will execute this 

* code. It is the idle loop where the child threads wait 

* for parallel work. 

* arguments 

* thr- thread pointer 

~k 

* algorithm: 

* Initialize some thread specific data structures. 

* Loop forever on the following: 

* Wait until we have work. 

* Get global values on what work needs to be done. 

* Call user specified function with argument. 

* Call barrier code to sync up all threads. 

^/static void 

idle_threads(thread_p thr) 

{ 

/* 

* initialized the thread thread-private memory pointer. 

*/ 

my_thread_ptr = thr; 

for(;;) { 

/* 

* threads spin here waiting for work to be assign 

* to them. 

*/ 

while ( thr->release_flag == idle_release_flag ) 

/* spin until idle_release_flag changes */; 

thr->release_flag = idle_release_flag; 
thr->nsthreads = my_call_thread_max; 

/* 

* call user function with their specified argument. 
*/ 

if ( thr->stid < my_call_thread_max ) 
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(*my_func)(my_arg); 

/* 

* make all threads join before they were to the idle 
loop. 

*/ 

my_join_barrier(my_barrier, thr->stid); 


/** create_threads 

* This routine creates all of the MY THREADS package data 

* structures and child threads. 

■k 

* arguments: 

* none 

■k 

* return: 

* none 

~k 

* algorithm: 

* Allocate data structures for a thread 

* Create the thread via the pthread_create call. 

* If the create call is successful, repeat until the 

* number of threads equal the number of processors. 


static void 
create_threads() 

{ 

pthread_attr_t attr; 
char *env_val; 

int i, rv, cpus, processors; 
thread_p thr; 

/* 

* allocate and initialize the thread structure for the 

* parent thread. 

*/ 

if ( (thr = (thread_p) malloc(sizeof(thread_t) )) == NULL ) { 

fprintf(stderr,"my_threads: Fatal error: can not 
allocate memory for main thread\n"); 
abort() ; 

} 

my_thread_ptr = thr; 

thr->stid = 0; 
thr->release_flag = 0; 

/* 

* initialize attribute structure 
*/ 

(void) pthread_attr_init(&attr); 

/* 

* Check to see if the CPS_STACK_SIZE env variable is defined. 
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* If it is, then use that as the stacksize. 
*/ 


if ( (env_val = getenv("CPS_STACK_SIZE")) != 

int val; 

val = atoi(env_val); 
if ( val > 128 ) 

my_stacksize = val * K; 


NULL 


) { 


(void) pthread_attr_setstacksize(&attr,my_stacksize); 

/* 

* determine how many threads we will create. 

*/ 

processors = cpus = pthread_num_processors_np(); 
if ( (env_val = getenv("MP_NUMBER_OF_THREADS")) != NULL ) { 

int val; 

val = atoi(env_val); 
if ( val >= 1 ) 
cpus = val; 


for(i = 1; i < cpus && i < MAX_THREADS; i + + ) { 

/* 

* allocate and initialize thread data structure. 

*/ 

if ( (thr = (thread_p) malloc(sizeof(thread_t))) == NULL ) 
break; 

thr->stid = i; 
thr->release_flag = 0; 

rv = pthread_create(&my_thread_ids[i-1], Sattr, 

(void *(*) (void *))idle_threads, (void *) thr); 
if ( rv != 0 ) { 

free(thr); 
break; 

} 

thread_count++; 

} 


my_threads_are_init = 1; 

my_barrier_alloc(&my_barrier, thread_count); 

/* 

* since we are done with this attribute, get rid of it. 
*/ 

(void) pthread_attr_destroy(&attr); 


/* 

* my_ppcall 

* Call user specified routine in parallel. 

■k 
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* arguments: 

* max- maximum number of threads that are needed. 

* func- user specified function to call 

* arg- user specified argument to pass to func 

■k 

* return: 

* 0- success 

* -1- error 

■k 

* algorithm: 

* If we are already parallel, then return with an error 

* code. Allocate threads and internal data structures, 

* if this is the first call. 

* Determine how many threads we need. 

* Set global variables. 

* Signal the child threads that they have parallel work. 

* At this point we signal all of the child threads and 

* let them determine if they need to take part in the 

* parallel call. Call the user specified function. 

* Barrier call will sync up all threads. 


int 

my_ppcall(int max, void (*func) (void *) , void *arg) 

{ 

thread_p thr; 
int i, suspend; 

/* 

* check for error conditions 
*/ 

if ( max <= 0 || func == NULL ) 
return EINVAL; 

if ( my_threads_are_parallel ) 
return EAGAIN; 

(void) pthread_mutex_lock(&my_thread_mutex); 
if ( my_threads_are_parallel ) { 

(void) pthread_mutex_unlock(&my_thread_mutex); 
return EAGAIN; 

} 

/* 

* create the child threads, if they are not already created. 
*/ 

if ( !my_threads_are_init ) 
create_threads() ; 


/* 

* set global variables to communicate to child threads. 
*/ 

if ( max > thread_count ) 

my_call_thread_max = thread_count; 
else 

my_call_thread_max = max; 
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my_func = func; 
my_arg = arg; 

my_thread_ptr->nsthreads = my_call_thread_max; 
++my_threads_are_parallel; 

/* 

* signal all of the child threads to exit the spin loop 

* / 

++idle_release_flag; 

(void) pthread_mutex_unlock(&my_thread_mutex); 

/* 

* call user func with user specified argument 
*/ 

(*my_func)(my_arg); 

/* 

* call join to make sure all of the threads are done doing 

* there work. 

*/ 

my_join_barrier(my_barrier, my_thread_ptr->stid); 

(void) pthread_mutex_lock(&my_thread_mutex); 

/* 

* reset the parallel flag 
*/ 

my_threads_are_parallel = 0; 

(void) pthread_mutex_unlock(&my_thread_mutex); 
return 0; 


/* 


* my_stid 

* Return thread spawn thread id. This will be in the range 

* of 0 to N-l, where N is the number of threads in the 

* process. 

* arguments: 

* none 

•k 

* return 

* spawn thread id 
*/ 

int 

my_stid(void) 

{ 

return my_thread_ptr->stid; 

} 
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/* 

* my_nsthreads 

* Return the number of threads in the current spawn. 

■k 

* arguments: 

* none 

■k 

* return 

* number of threads in the current spawn 
*/ 

int 

my_nsthreads(void) 

{ 

return my_thread_ptr->nsthreads; 

} 


/* 

* my_is_parallel 

* Return the is parallel flag 

~k 

* arguments: 

* none 

■k 

* return 

* 1- if we are parallel 

* 0- otherwise 
*/ 

int 

my_is_parallel(void) 

{ 

int rv; 

/* 

* if my_threads_are_init is set, then we are parallel, 

* otherwise we not. 

*/ 

(void) pthread_mutex_lock(&my_thread_mutex); 
rv = my_threads_are_init; 

(void) pthread_mutex_unlock(&my_thread_mutex); 
return rv; 

} 


/* 

* my_complex_cpus 

* Return the number of threads in the current process. 

■k 

* arguments: 

* none 

■ k 

* return 

* number of threads created by this process 
*/ 
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int 

my_complex_cpus (void) 

{ 

int rv; 

/* 

* Return the number of threads that we current have. 
*/ 

(void) pthread_mutex_lock(&my_thread_mutex); 
rv = thread_count; 

(void) pthread_mutex_unlock(&my_thread_mutex); 
return rv; 

} 


Asymmetric parallelism 

Asymmetric parallelism is used when each thread executes a different, 
independent instruction stream. Asymmetric threads are analogous to 
the Unix fork system call construct in that the threads aredisjoined. 

Some of the asymmetric CPSlib functions map to pthread functions, 
while others are no longer used, as noted below: 

• cps_thread_create () spawned asymmetric threads and now maps 
to the pthread function pthread_create () . 

• cps_thread_createn ( ), which spawned asymmetric threads with 
multiple arguments, also maps to pthread_create (). However, 
pthread_create () only supports the passing of one argument. 

• CPSlib terminated asymmetric threads using cps_thread_exit () , 
which now maps to the pthread function pthread_exit ( ). 

• cps_thread_register_iock has no corresponding pthread 
function. It was formerly used in conjunction with m_iock, both of 
which have been replaced with onecall topthread_join. 

• cps_pievei () , theCPSlibfunction which determined thecurrent 
level of parallelism, does not have a corresponding pthread function, 
because levels do not mean anything to pthreads. 

The first example in this section cps_create. c, provides an exampleof 
the above CPSlib functions being used to create asymmetric parallelism. 
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create.c 

/* 

* create.c 

* Show how to use all of the cps asymmetric functions. 

* 

*/ 

#include <cps.h> 
mem_sema_t wait_lock; 
void 

tfunc(void *arg) 

{ 

int i; 

/* 

* Register the wait_lock, so that the parent thread 

* can wait on us to exit. 

*/ 

(void) cps_thread_register_lock(&wait_lock); 

for ( i = 0; i < 100000; i + + ) 

/* spin for a spell */; 

printf("tfunc: ktid = %d\n", cps_ktid()); 
cps_thread_exit(); 

} 

main () 


int node = 0; 
ktid_t ktid; 

/* 

* Initialize and lock the wait_lock. 

*/ 

m_init32(&wait_lock, &node); 
m_cond_lock(&wait_lock); 

ktid = cps_thread_create(Snode, tfunc, NULL); 

/* 

* We wait for the wait_lock to be release. That is 

* how we know that the child thread 

* has terminated. 

*/ 

m_lock (&wait_lock) ; 
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pth_create.c 

The example below shows how to use the pth_create. c function to 
map to asymmetric functions provided by the CPSlib example. 

/* 

* pth_create.c 

* Show how to use all of the pthread functions that 
map to cps asymmetric functions. 

■k 
:k 

* / 

#include <pthread.h> 
void 

tfunc(void *arg) 

{ 

int i; 

for ( i = 0; i < 100000; i + + ) 

/* spin for a spell */; 

printf ( "tfunc: ktid = %d\n", pthread_self()); 
pthread_exit(0); 

} 

main () 


pthread_t ktid; 
int status; 

(void) pthread_create(Sktid, NULL, (void *(*)(void *) 
tfunc, NULL); 


/* 

* Wait for the child to terminate. 
*/ 

(void) pthread_join(ktid, NULL); 
exit (0) ; 
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Synchronization using high-level functions 

This section demonstrates how to use barriers and mutexes to 
synchronize symmetrically parallel code. 

Barriers 

I mplicit barriers are operations in a program where threads are 
restricted from completion based upon the status of the other threads. 
For example, in theppcaii. c example (on page 319), a join operation 
occurs after all spawned threads terminate and before the function 
returns. This type of implicit barrier is often the only type of barrier 
required. 

The my_barrier. c example shown below provides a pthreads 
implementation of CPSlib barrier routines. This includes thefollowing 
example functions: 

• my_init_barrier is similar tothe cps_barrier_alloc function 
in that it allocates the barrier (br) and sets its associated memory 
counter to zero. 

• my_barrier, Ii ke the CPSlib function cps_barrier, operates as 
barrier wait routine. When the value of the shared counter is equal to 
the argument n (number of threads), the counter is set to zero. 

• my_barrier-destroy, like cps_barrier_f ree, releases the 
barrier. 

my_b a r rie r.c 

/* 

* my_barrier.c 

*Code to support a fetch and increment type barrier. 

*/ 

#ifndef _HPUX_SOURCE 
#define _HPUX_SOURCE 
#endif 

#include <pthread.h> 

#include <errno.h> 

/* 

* barrier 

* magic 

* counter 

* release 

•k 

* lock 


barrier valid flag 

shared counter between threads 

shared release flag, used to signal waiting 

threads to stop waiting. 

binary semaphore use to control read/write 
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* access to counter and write access to 

* release. 

*/ 

struct barrier { 
int 

int volatile 
int volatile 
pthread_mutex_t 

}; 

#define VALID_BARRIER 0x4242beef 

#define INVALID_BARRIER Oxdeadbeef 

typedef struct barrier barrier_t; 
typedef struct barrier *barrier_p; 

/* 

* my_barrier_init 

* Initialized a barrier for use. 

■k 

* arguments 

* br- pointer to the barrier to be initialize. 

■k 

* return 

* 0- success 

* >0- error code of failure. 

*/ 

int 

my_barrier_init(barrier_p *br) 

{ 

barrier_p b, n; 
int rv; 

b = (barrier_p) *br; 

if ( b != NULL ) 
return EINVAL; 

if ( (n = (barrier_p) malloc(sizeof(*n))) == NULL ) 
return ENOMEM; 

if ( (rv = pthread_mutex_init(&n->lock, NULL)) != 0 ) 
return rv; 

n->magic = VALID_BARRIER; 
n->counter = 0; 
n->release = 0; 

*br = n; 

return 0; 

} 

/* 


magic; 

counter; 

releasee- 

lock; 
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* my_barrier 

* barrier wait routine. 

~k 


* arguments 

* br - barrier to wait on 

* n - number of threads to wait on 


* return 

* 0 - success 

* EINVAL - invalid arguments 


*/ 

int 

my_barrier(barrier_p br, int n) 


int rv; 
int key; 


if ( br == NULL || br->magic != VALID_BARRIER ) 
return EINVAL; 


pthread_mutex_lock(&br->lock); 

key = br->release;/* get release flag */ 

rv = br->counter++;/* fetch and inc shared counter */ 

/* 

* See if we are the last thread into the barrier 
*/ 

if ( rv == n-1 ) { 

/* 

* We are the last thread, so clear the counter 


* and signal the other threads by changing the 

* release flag. 

*/ 

br->counter = 0; 

++br->release; 

pthread_mutex_unlock(&br->lock); 

} else { 

pthread_mutex_unlock(&br->lock); 

/* 

* We are not the last thread, so wait 

* until the release flag changes. 

*/ 

while( key == br->release ) 

/* spin */; 

} 


return 0; 

} 


/* 

* my_barrier_destroy 
^destroy a barrier 

■k 
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* arguments 

*b- barrier to destory 

■k 

* return 
*0- success 

*> 0 - error code for why can not destroy barrier 
*/ 

int 

my_barrier_destroy(barrier_p *b) 

{ 

barrier_p br = (barrier_p) *b; 
int rv; 

if ( br == NULL | | br->magic != VALID_BARRIER ) 
return EINVAL; 

if ( (rv = pthread_mutex_destroy(&br->lock)) != 0 ) 

return rv; 

br->magic = INVALID_BARRIER; 
br->counter = 0; 
br->release = 0; 

*b = NULL; 

return 0; 

} 

Mutexes 

M utexes (binary semaphores) al low threads to control access to shared 
data and resources. The CPSlib mutex functions map directly to existing 
pthread mutex functions as shown in Table 63 on page 311. The example 
below, pth_mutex. c, shows a basic pthread mutex program using the 

pthread_mutex_init, pthread_mutex_lock, 
pthread_mutex_trylock, and pthread_mutex_unlock. 

There are some differences between the behavior of CPSlib mutex 
functions and low-level locks (cache semaphores and memory 
semaphores) and the behavior of pthread mutex functions, as described 
below: 

• CPS cache and memory semaphores do not perform deadlock 
detection. 

• Thedefault pthread mutex does not perform deadlock detection under 
HP-UX. This may bedifferent from other operating systems. 
pthread_mutex_iock will only detect deadlock if the mutex is of the 
type PTHREAD_MUTEX_ERRORCHECK. 
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• All of the CPSlib unlock routines allow other threads to release a lock 
that they do not own. This is not true with pthread_mutex_uniock. 
If you do this with pthread_mutex_uniock, it will result in 
undesirable behavior. 

pth_mutex. c 

/* 

* pth_mutex.c 

* Demostrate pthread mutex calls. 

■k 

* Notes when switching from cps mutex, cache semaphore or 

* memory semaphores to pthread mutex: 

~k 

*1) Cps cache and memory semaphores did no checking. 

*2) All of the cps semaphore unlock routines allow 

* other threads to release a lock that they do not 

* own. This is not the case with 

* pthread_mutex_unlock. It is either a error or a 

* undefinedbehavior. 

*3) The default pthread mutex does not do deadlock 

* detection under HP-UX (this can be different on 
other operation systems). 

*/ 

#ifndef _HPUX_SOURCE 
#define _HPUX_SOURCE 
#endif 

#include <pthread.h> 

#include <errno.h> 

pthread_mutex_t counter_lock; 
int volatile counter = 0; 

void 
tfunc () 

{ 

(void) pthread_mutex_lock(&counter_lock); 

++counter; 

(void) pthread_mutex_unlock(&counter_lock); 

} 


main () 

{ 

pthread_t tid; 

if ( (errno = pthread_mutex_init(&counter_lock, NULL)) != 0 ) 

{ 

perror("pth_mutex: pthread_mutex_init failed"); 
abort(); 

} 


if ( (errno = pthread_create(&tid, NULL, (void *(*)(void *)) 
tfunc, NULL)) != 0 ) { 
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perror("pth_mutex: pthread_create failed"); 
abort(); 


tfunc(); 

(void) pthread_join(tid, NULL); 

if ( (errno = pthread_mutex_destroy(Scounter_lock)) != 0 ) { 

perror("pth_mutex: pthread_mutex_destroy failed"); 
abort (); 


if ( counter != 2 ) { 

errno = EINVAL; 

perror("pth_mutex: counter value is wrong"); 
abort(); 


printf("PASSED\n" ) ; 
exit (0) ; 


Synchronization using low-level functions 

This section demonstrates how to use semaphores to synchronize 
symmetrically parallel code. This includes functions, such as low-level 
locks, for which there are pthread mappings, and low-level counter 
semaphores for which there are no pthread mappings. I n this instance, 
an example is provided so that you can create a program to emulate 
CPSlib functions, using pthreads. 

Low-level locks 

The disposition of CPSlib's low-level locking functions is handled by the 
pthread mutex functions (as described in Table 63 on page 311). See 
“M utexes” on page 335 for an example of how to use pthread mutexes. 

Low-level counter semaphores 

The CPSlib [me] _init32 routines allocate and set the low-level CPSlib 
semaphores to be used as counters. There are no pthread mappings for 
these functions. However, a pthread example is provided below. 

This example, fetch_and_inc. c, documents the foil owing tasks: 

• my_init allocates a counter semaphore and initializes the counter 
associated with it (p) to a value. 
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• my_f etch_and_ciear returns the current value of the counter 
associated with the semaphore and clears the counter. 

• my_f etch_and_inc increments the value of the counter associated 
with the semaphore and returns the old value. 

• my_f etch_and_dec decrements the value of the counter associated 
with the semaphore and returns the old value. 

• my_f etch_and_add adds a value (int val) to the counter associated 
with the semaphore and returns the old value of the integer. 

• my_f etch_and_set returns the current value of the counter 
associated with the semaphore, and sets the semaphore to the new 
value contained in int val. 

The [me]_init32 routines allocateand set the low-level cps 
semaphores to be used as either counters or locks. An example for 
counters provides pthread implementation in the place of the following 
CPSlib functions: 

• [me]fetch32 

• [me]_fetch_and_clear32 

• [me]_fetch_and_inc32 

• [me]_fetch_and_dec32 

• [me]_fetch_and_add32 

• [me]_fetch_and_set32 

fetch_and_inc.c 

/* 

* fetch_and_inc 

* How to support fetch_and_inc type semaphores using pthreads 

■k 

*/ 

#ifndef _HPUX_SOURCE 
#define _HPUX_SOURCE 
#endif 

#include <pthread.h> 

#include <errno.h> 

struct fetch_and_inc { 
int volatilevalue; 
pthread_mutex_tlock; 
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}; 

typedef struct fetch_and_inc fetch_and_inc_t; 
typedef struct fetch_and_inc *fetch_and_inc_p; 

int 

my_init(fetch_and_inc_p ^counter, int val) 

{ 

fetch_and_inc_p p; 
int rv; 

if ( (p = (fetch_and_inc_p) malloc(sizeof (*p))) 
return ENOMEM; 

if ( (rv = pthread_mutex_init(&p->lock, NULL)) ! 

return rv; 

p->value = val; 

^counter = p; 

return 0; 

} 


int 

my_fetch(fetch_and_inc_p counter) 

{ 

int rv; 

pthread_mutex_lock(&counter->lock) ; 
rv = counter->value; 

pthread_mutex_unlock(&counter->lock) ; 
return rv; 

} 


int 

my_fetch_and_clear(fetch_and_inc_p counter) 

{ 

int rv; 

pthread_mutex_lock(&counter->lock) ; 

rv = counter->value; 
counter->value = 0; 

pthread_mutex_unlock(&counter->lock); 
return rv; 

} 


int 

my_fetch_and_inc(fetch_and_inc_p counter) 

{ 


== NULL ) 

= o ) 
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int rv; 

pthread_mutex_lock(&counter->lock) ; 
rv = counter->value++; 
pthread_mutex_unlock(&counter->lock) ; 
return rv; 

} 

int 

my_fetch_and_dec (fetch_and_inc_p counter) 

{ 

int rv; 

pthread_mutex_lock(&counter->lock) ; 
rv = counter->value—; 
pthread_mutex_unlock(&counter->lock) ; 
return rv; 

} 

int 

my_fetch_and_add(fetch_and_inc_p counter, int val) 

{ 

int rv; 

pthread_mutex_lock(&counter->lock) ; 

rv = counter->value; 
counter->value += val; 

pthread_mutex_unlock(&counter->lock); 

return rv; 

} 

int 

my_fetch_and_set(fetch_and_inc_p counter, int val) 

{ 

int rv; 

pthread_mutex_lock(&counter->lock) ; 

rv = counter->value; 
counter->value = val; 

pthread_mutex_unlock(&counter->lock); 
return rv; 

} 
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absolute address An address 
that does not undergo virtual-to- 
physical address translation when 
used to reference memory or the 
I/O register area. 

accumulator A variable used to 
accumulate value. Accumulators 
are typically assigned a function of 
themselves, which can create 
dependences when done in loops. 

actual argument I n Fortran, a 
value that is passed by a call to a 
procedure (function or subroutine). 
The actual argument appears in 
the source of the cal ling procedure; 
the argument that appears in the 
source of the called procedure is a 
dummy argument. C and C++ 
conventions refer to actual 
arguments as actual parameters. 

actual parameter I n C and 

C++, a value that is passed by a 
call to a procedure (function). The 
actual parameter appears in the 
source of the cal ling procedure; the 
parameter that appears in the 
source of the called procedure is a 
formal parameter. Fortran 
conventions refer to actual 
parameters as actual arguments. 

address A number used by the 
operating system to identify a 
storage location. 


address space M emory space, 
either physical or virtual, available 
to a process. 

alias An alternative name for 
some object, especially an 
alternative variable name that 
refers to a memory location. 
Aliases can cause data 
dependences, which prevent the 
compiler from parallelizing parts 
of a program. 

alignment A condition in which 
the address, in memory, of a given 
data item is integrally divisible by 
a particular integer value, often 
the size of the data item itself. 
Alignment simplifies the 
addressing of such data items. 

allocatablearray In Fortran 
90, a named array whose rank is 
specified at compile time, but 
whose bounds are determined at 
run time. 

allocate An action performed by 
a program at runtime in which 
memory is reserved to hold data of 
a given type. I n Fortran 90, this is 
done through the creation of 
allocatable arrays. I n C, it is done 
through the dynamic creation of 
memory blocks using maiioc. I n 
C++, it is done through the 
dynamiccreation of memory blocks 
using malloc or new. 
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ALU Arithmetic logic unit. A 
basic element of the central 
processing unit (CPU) where 
arithmetic and logical operations 
are performed. 

Amdahl's law A statement that 
the ultimate performance of a 
computer system is limited by the 
slowest component. I n the context 
of H P servers this is interpreted to 
mean that the serial component of 
the application code will restrict 
the maximum speed-up that is 
achievable. 

American National Standards 
Institute (ANSI) A repository 
and coordinating agency for 
standards implemented in the U.S. 
Its activities include the 
production of Federal I nformation 
Processing (FIPS) standards for 
the Department of Defense (DoD). 

ANSI See American National 
Standards I nstitute. 

apparent recurrence A 

condition or construct that fails to 
provide the compiler with 
sufficient information todetermine 
whether or not a recurrence exists. 
Also called a potential recurrence. 

argument I n Fortran, either a 
variable declared in the argument 
list of a procedure (function or 
subroutine) that receives a value 
when the procedure is called 
(dummy argument) or the variable 
or constant that is passed by a call 
to a procedure (actual argument). 

C and C++ conventions refer to 
arguments as parameters. 


arithmetic logic unit (ALU) A 

basic element of the central 
processing unit (CPU) where 
arithmetic and logical operations 
are performed. 

array An ordered structure of 
operands of the same data type. 
The structure of an array is 
defined by its rank, shape, and 
data type. 

array section A Fortran 90 
construct that defines a subset of 
an array by providing starting and 
ending elements and strides for 
each dimension. For an array 
A (4, 4), A (2:4:2,2:4:2) isan 
array section containing only the 
even I y i ndexed el ements a (2,2), 

A (4,2) , A (2, 4) , and A (4, 4) . 

array-valued argument I n 

Fortran 90, an array section that is 
an actual argument to a 
subprogram. 

ASCII American Standard Code 
for I nformation I nterchange. This 
encodes printable and non- 
printablecharacters intoa range of 
integers. 

assembler A program that 
converts assembly language 
programs into executable machine 
code. 

assembly language A 

programming language whose 
executable statements can each be 
translated directly into a 
corresponding machine instruction 
of a particular computer system. 
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automatic array I n Fortran, an 
array of explicit rank that is not a 
dummy argument and is declared 
in a subprogram. 

bandwidth A measure of the 
rate at which data can be moved 
through a device or circuit. 
Bandwidth is usually measured in 
millions of bytes per second 
(M bytes/sec) or millions of bits per 
second (M bits/sec). 

bank conflict An attempt to 
access a particular memory bank 
before a previous access to the 
bank is complete, or when the 
bank is not yet finished recycling 
(i.e., refreshing). 

barrier A structure used by the 
compiler in barrier 
synchronization. Also sometimes 
used to refer to the construct used 
to implement barrier 
synchronization. See also barrier 
synchronization. 

barrier synchronization A 

control mechanism used in parallel 
programming that ensures all 
threads have completed an 
operation before continuing 
execution past the barrier in 
sequential mode. On HP servers, 
barrier synchronization can be 
automated by certain CPSlib 
routines and compiler directives. 
See also barrier. 

basic block A linear sequence of 
machine instructions with a single 
entry and a single exit. 

bit A binary digit. 


blocking factor Integer 
representi ng the stride of the outer 
strip of a pair of loops created by 
blocking. 

branch A class of instructions 
which change the value of the 
program counter to a value other 
than that of the next sequential 
instruction. 

byte A group of contiguous bits 
starting on an addressable 
boundary. A byte is 8 bits in 
length. 

cache A small, high-speed buffer 
memory used i n modern computer 
systems to hold temporarily those 
portions of the contents of the 
memory that are, or are bel ieved to 
be, currently in use. Cache memory 
is physically separate from main 
memory and can be accessed with 
substantially less latency. HP 
servers employ separate data and 
instruction cache memories. 

cache, direct mapped A form 
of cache memory that addresses 
encached data by a function of the 
data's virtual address. On V2250 
servers, the processor cache 
address is identical to the least- 
significant 21 bits of the data's 
virtual address. This means cache 
thrashing can occur when the 
virtual addresses of two data items 
are an exact multi pie of 2 Mbyte 
(21 bits) apart. 

cache hit A cache hit occurs if 
data to be loaded is residing in the 
cache. 
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cache line A chunk of 
contiguous data that is copied into 
a cachein oneoperation. On V2250 
servers, processor cache lines are 
32 bytes 

cache memory A small, high¬ 
speed buffer memory used in 
modern computer systems to hold 
temporarily those portions of the 
contents of the memory that are, or 
are believed to be, currently in use. 
Cache memory is physically 
separate from main memory and 
can be accessed with substantially 
less latency. V2250 servers employ 
separate data and instruction 
caches. 

cachemiss A cache miss occurs 
if data to be loaded is not residing 
in the cache. 

cache purge The act of 

invalidating or removing entries in 
a cache memory. 

cache thrashing Cache 
thrashing occurs when two or more 
data items that are frequently 
needed by the program map to the 
same cache address. I n this case, 
each time one of the items is 
encached it overwrites another 
needed item, causing constant 
cache misses and impairing data 
reuse. Cache thrashing also occurs 
when two or more threads are 
simultaneously writing tothe 
same cache line. 

central processing unit 
(CPU) The central processing 
unit (CPU) is that portion of a 
computer that recognizes and 
executes the instruction set. 


clock cycle The duration of the 
square wave pulse sent throughout 
a computer system to synchronize 
operations. 

clone A compiler-generated copy 
of a loop or procedure. When the 
H P compilers generate code for a 
paralleiizable loop, they generate 
two versions: a serial clone and a 
parallel clone. See also dynamic 
selection. 

code A computer program, 
either in source form or in the form 
of an executable image on a 
machine. 

coherency A term frequently 
applied to caches. If a data item is 
referenced by a particular 
processor on a multiprocessor 
system, the data is copied intothat 
processor’s cache and is updated 
there if the processor modifies the 
data. If another processor 
references the data while a copy is 
still in the first processor's cache, a 
mechanism is needed to ensure 
that the second processor does not 
use an outdated copy of the data 
from memory. The state that is 
achieved when both processors' 
caches always have the latest 
valuefor the data is called cache 
coherency. On multiprocessor 
servers an item of data may reside 
concurrently in several processors' 
caches. 

column-major order Memory 
representation of an array such 
that the columns are stored 
contiguously. For example, given a 
two-di mensional array a (3,4), 
the array element a (3, l) 
immediately precedes element 
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a (1,2) in memory. This is the 
default storage method for arrays 
in Fortran. 

compiler A computer program 
that translates computer code 
written in a high-level 
programming language, such as 
Fortran, into equivalent machine 
language. 

concurrent In parallel 
processing, threads that can 
execute at the same time are cal led 
concurrent threads. 

conditional induction 
variable A loop induction 
variable that is not necessarily 
incremented on every iteration. 

constant folding Replacement 
of an operation on constant 
operands with the result of the 
operation. 

constant propagation The 

automatic compile-time 
replacement of variable references 
with a constant value previously 
assigned tothat variable. Constant 
propagation is performed within a 
single procedure by conventional 
compilers. 

conventional compiler A 

compiler that cannot perform 
inter procedural optimization. 

counter A variablethat is used 
to count the number of times an 
operation occurs. 

CPA CPU Agent. The gate array 
on V2250 servers that provides a 
high-speed interface between pairs 
of PA-RI SC processors and the 
crossbar. Also called the CPU 
Agent and the agent. 


CPU Central processing unit. 
The central processing unit (CPU) 
is that portion of a computer that 
recognizes and executes the 
instruction set. 

CPU Agent The gate array on 
V2250 servers that provides a 
high-speed interfacebetween pairs 
of PA-RI SC processors and the 
crossbar. 

CPU-private memory Data 
that is accessible by a single 
thread only (not shared among the 
threads constituting a process). A 
thread-private data object has a 
unique virtual address which maps 
to a unique physical address. 
Threads access the physical copies 
of thread-private data residing on 
their own hypernode when they 
access thread-private virtual 
addresses. 

CPU time The amount of time 
the CPU requires to execute a 
program. Because programs share 
access to a CPU, the wall-clock 
ti me of a program may not be the 
sameasitsCPU time. If a program 
can use multiple processors, the 
CPU time may be greater than the 
wall-clock time. (See wall-clock 
ti me.) 

critical section A portion of a 
parallel program that can be 
executed by only one thread at a 
time. 

crossbar A switching devicethat 
connects the CP Us, banks of 
memory, and I/O controller on a 
single hypernode of a V2250 
server. Because the crossbar is 
nonblocking, all ports can run at 
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full bandwidth simultaneously, 
provided there is not contention for 
a particular port. 

CSR Control/Status Register. A 
CSR is a software-addressable 
hardware register used to hold 
control information or state. 

data cache (Dcache) A small 
cache memory with a fast access 
time. This cache holds prefetched 
and current data. On V2250 
servers, processors have 2-M byte 
off-chip caches. See also cache; 
direct mapped. 

data dependence A 

relationship between two 
statements in a program, such that 
one statement must precede the 
other to produce the i ntended 
result. (See also loop-carried 
dependence (LCD) and loop- 
independent dependence (LID).) 

data localization 

Optimizations designed to keep 
frequently used data in the 
processor data cache, thus 
el i mi nati ng the need for more 
costly memory accesses. 

data type A property of a data 
item that determines how its bits 
are grouped and interpreted. For 
processor instructions, the data 
type identifies the size of the 
operand and the significance of the 
bits in the operand. Some example 
data types include integer, int, 
real, and float. 

Dcache Data cache. A small 
cache memory with a one clock 
cycle access time under pipelined 
conditions. This cache holds 


prefetched and current data.On 
V2250 servers, this cache is 2 
M bytes. 

deadlock A condition in which a 
thread waits indefinitely for some 
condition or action that cannot, or 
will not, occur. 

direct memory access (DMA) 

A method for gaining direct access 
to memory and achieving data 
transfers without involving the 
CPU. 

distributed memory A memory 
architecture used in multi-CPU 
systems, in which the system's 
memory is physically divided 
among the processors. I n most 
distributed-memory architectures, 
memory is accessible from the 
single processor that owns it. 
Sharing of data requires explicit 
message passing. 

distributed part A loop 
generated by the compiler in the 
process of loop distribution. 

DMA Direct memory access. A 
method for gaining direct access to 
memory and achieving data 
transfers without involving the 
CPU. 

double A double-precision 
floating-point number that is 
stored in 64 bits in C and C++. 

doubleword A primitive data 
operand which is 8 bytes (64 bits) 
in length. Also called a longword. 
See also word. 

dummy argument In Fortran, a 
variable declared in the argument 
list of a procedure (function or 
subroutine) that receives a value 
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when the procedure is called. The 
dummy argument appears in the 
source of the called procedure; the 
parameter that appears in the 
source of the calling procedure is 
an actual argument. C and C++ 
conventions refer to dummy 
arguments as formal parameters. 

dynamic selection The process 
by which the compiler chooses the 
appropriate runtime clone of a 
loop. See also cl one. 

encache To copy data or 
instructions into a cache. 

exception A hardware-detected 
event that interrupts the running 
of a program, process, or system. 
See also fault. 

execution stream A series of 
instructions executed by a CPU. 

fault A type of i nterruption 
caused by an instruction 
requesting a legitimate action that 
cannot be carried out immediately 
due to a system problem. 

floating-point A numerical 
representation of a real number. 
On V2250 servers, a floating point 
operand has a sign (positive or 
negative) part, an exponent part, 
and a fraction part. The fraction is 
a fractional representation. The 
exponent is the value used to 
produce a power of two scale factor 
(or portion) that is subsequently 
used to multiply thefraction to 
produce an unsigned value. 

FLOPS Floating-point 
operations per second. A standard 
measure of computer processing 
power in the scientific community. 


formal parameter I n C and 

C++, a variable declared in the 
parameter list of a procedure 
(function) that receives a value 
when the procedure is called. The 
formal parameter appears in the 
source of the called procedure; the 
parameter that appears in the 
source of the calling procedure is 
an actual parameter. Fortran 
conventions refer to formal 
parameters as dummy arguments. 

Fortran A high-level software 
language used mainly for scientific 
applications. 

Fortran 90 The international 
standard for Fortran adopted in 
1991. 

function A procedure whose cal I 
can be imbedded within another 
statement, such as an assignment 
or test. Any procedure in C or C++ 
or a procedure defined as a 
FUNCTION in Fortran. 

functional unit (FU) A part of 
a CPU that performs a set of 
operations on quantities stored in 
regi sters. 

gate A construct that restricts 
execution of a block of code to a 
single thread. A thread locks a 
gate on enteri ng the gated bl ock of 
code and unlocks the gate on 
exiting the block. When the gate is 
locked, no other threads can enter. 
Compiler directives can be used to 
automate gate constructs; gates 
can also be implemented using 
semaphores. 

G byte See gi ga byte 
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gigabyte 1073741824 (2 30 ) 
bytes. 

global optimization A 

restructuring of program 
statements that is not confi ned to a 
single basic block. Global 
optimization, unlike 
interprocedural optimization, is 
confined to a single procedure. 
Global optimization is done by HP 
compilers at optimization level +02 
and above. 

global register allocation 
(GRA) A method by which the 
compiler attempts to store 
commonly-referenced scalar 
variables in registers throughout 
the code in which they are most 
frequently accessed. 

global variable A variable 
whose scope is greater than a 
single procedure. I n C and C++ 
programs, a global variable is a 
variable that is defined outside of 
any one procedure. Fortran has no 
global variables per se, but common 
blocks can be used to make certain 
memory locations globally 
accessible. 

granularity I n the context of 
parallelism, a measure of the 
rel ative size of the computation 
done by a thread or parallel 
construct. Performance is 
generally an increasing function of 
the granularity. In higher-level 
language programs, possiblesizes 
are routine, loop, block, statement, 
and expression. Fine granularity 
can be exhibited by parallel loops, 
tasks and expressions, Coarse 
granularity can be exhibited by 
parallel processes. 


hand-rolled loop A loop, more 
common in Fortran than C or C++, 
that is constructed using if tests 
and goto statements rather than a 
language-provided loop structure 
such as do. 

hidden alias An alias that, 
because of the structure of a 
program or the standards of the 
language, goes undetected by the 
compiler. H idden aliases can result 
in undetected data dependences, 
which may result in wrong 
answers. 

High Performance Fortran 
(HPF) An ad-hoc language 
extension of Fortran 90 that 
provides user-directed data 
distribution and alignment. H PF is 
not a standard, but rather a set of 
features desirable for parallel 
programming. 

hoist An optimization process 
that moves a memory load 
operation from within a loop to the 
basic block preceding the loop. 

HP Hewlett-Packard, the 
manufacturer of the PA-RI SC 
chips used as processors in V2250 
servers. 

HP-UX Hewlett-Packard's Unix- 
based operating system for its 
PA-RI SC workstations and 
servers. 

hypercube A topology used in 
some massively parallel processing 
systems. Each processor is 
connected to its binary neighbors. 
The number of processors in the 
system is always a power of two; 
that power is referred to as the 
dimension of the hypercube. For 
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example, a 10-dimensional 
hypercube has 2 10 , or 1,024 
processors. 

hypernode A set of processors 
and physical memory organized as 
a symmetric multiprocessor (SMP) 
running a single image of the 
operating system. Non sea I able 
servers and V2250 servers consist 
of one hypernode. When discussing 
multidimensional parallelism or 
memory classes, hypernodes are 
generally called nodes. 

Icache I nstruction cache. This 
cache holds prefetched instructions 
and permits the simultaneous 
decoding of one instruction with 
the execution of a previous 
instruction. On V2250servers, this 
cache is 2 M bytes. 

IEEE Institutefor Electrical and 
Electronic Engineers. An 
international professional 
organization and a member of 
ANSI and ISO. 

induction variable A variable 
that changes linearly within the 
loop, that is, whose value is 
incremented by a constant amount 
on every iteration. For example, in 
thefollowing Fortran loop, i, j and 
k are induction variables, but l is 
not. 

DO I = 1, N 

J = J + 2 

K = K + N 

L = L + I 

ENDDO 


inlining The replacement of a 
procedure (function or subroutine) 
call, within the source of a calling 
procedure, by a copy of the called 
procedure’s code. 

I nstitutefor Electrical and 
Electronic Engineers (IEEE) 

An international professional 
organization and a member of 
ANSI and ISO. 

instruction One of the basic 
operations performed by a CPU. 

instruction cache (Icache) 

This cache holds prefetched 
instructions and permits the 
simultaneous decoding of one 
instruction with the execution of a 
previous instruction. On V2250 
servers, this cache is 2 M bytes. 

instruction mnemonic A 

symbolic name for a machine 
instruction. 

integral division Division that 
results in a whole number solution 
with no remainder. For example, 

10 is integrally divisible by 2, but 
not by 3. 

interface A logical path 
between any two modules or 
systems. 

interleaved memory Memory 
that is divided into multiple banks 
to permit concurrent memory 
accesses. The number of separate 
memory banks is referred to as the 
memory stride. 

inter procedural 
optimization Automatic 
analysis of relationships and 
interfaces between all subroutines 
and data structures within a 
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program. Traditional compilers 
analyze only the relationships 
within the procedure being 
compiled. 

inter processor 

communication The process of 
movi ng or shari ng data, and 
synchronizing operations between 
processors on a multiprocessor 
system. 

intrinsic A function or 
subroutine that is an inherent part 
of a computer language. For 
example, sin is a Fortran 
intrinsic. 

job scheduler That portion of 
the operating system that 
schedules and manages the 
execution of all processes. 

join The synchronized 
termination of parallel execution 
by spawned tasks or threads. 

jump Departure from normal 
one-step i ncrementi ng of the 
program counter. 

kbyte See kilobyte. 

kernel The core of the operating 
system where basic system 
facilities, such as file access and 
memory management functions, 
are performed. 

kernel thread identifier 
(ktid) A uniqueinteger identifier 
(not necessarily sequential) 
assigned when a thread is created. 

kilobyte 1024 (2 10 ) bytes. 

latency The time delay between 
the issuing of an instruction and 
the completion of the operation. A 
common benchmark used for 


comparing systems is the latency 
of coherent memory access 
instructions. This particular 
latency measurement is believed to 
be a good indication of the 
seal abi I ity of a system; Iow Iatency 
equates to low system overhead as 
system size increases. 

linker A software tool that 
combines separate object code 
modules intoa singleobject code 
module or executable program. 

load An instruction used to move 
the contents of a memory location 
into a register. 

locality of reference An 

attribute of a memory reference 
pattern that refers to the 
I i kel i hood of an address of a 
memory reference bei ng physical ly 
close to theCPU making the 
reference. 

local optimization 

Restructuring of program 
statements withi n the scope of a 
basic block. Local optimization is 
done by H P compilers at 
optimization level +01 and above. 

localization Data localization. 
Optimizations designed to keep 
frequently used data in the 
processor data cache, thus 
el i mi nati ng the need for more 
costly memory accesses. 

logical address Logical address 
space is that address as seen by 
the application program. 

logical memory Virtual 
memory. The memory space as 
seen by the program, which may be 
larger than the available physical 
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memory. The virtual memory of a 
V2250 server can be up to 16 
Tbytes. HP-UX can map this 
virtual memory to a smaller set of 
physical memory, using disk space 
to make up the difference if 
necessary. AI so cal led 
virtual memory. 

longword (I) Doubleword. A 
primitive data operand which is 8 
bytes (64 bits) in length. See also 
word. 

loop blocking A loop 
transformation that strip mines 
and interchanges a loop to provide 
optimal reuse of the encachable 
loop data. 

loop-carried dependence 
(LCD) A dependence between 
two operations executed on 
different iterations of a given loop 
and on the same iteration of al I 
enclosing loops. A loop carries a 
dependence from an indexed 
assignment to an indexed use if, 
for some iteration of the loop, the 
assignment stores into an address 
that is referred toon a different 
iteration of the loop. 

loop constant A constant or 
expression whose value does not 
change within a loop. 

loop distribution The 

restructuring of a loop nest to 
create simple loop nests. Loop 
distribution creates two or more 
loops, called distributed parts, 
which can serve to make 
parallelization more efficient by 
increasing the opportunities for 
loop interchangeand isolating code 
that must run serially from 


parallelizablecode. It can also 
improve data localization and 
other optimizations. 

loop-independent dependence 
(LID) A dependence between two 
operations executed on the same 
iteration of all enclosing loops such 
that one operation must precede 
the other to produce correct 
results. 

loop induction variable See 

induction variable 

loop interchange The 

reordering of nested loops. Loop 
interchange is generally done to 
increase the granularity of the 
parallelizable loop(s) present or to 
allow more efficient access to loop 
data. 

loop invariant Loop constant. A 
constant or expression whose value 
does not change within a loop. 

loop invariant computation 

An operation that yields the same 
result on every iteration of a loop. 

loop replication The process of 
transforming one loop into more 
than one loop to facilitate an 
optimization. The optimizations 
that replicate loops are if-do and 
if-for optimizations, dynamic 
selection, loop unrolling, and loop 
blocking. 

machine exception A fatal 
error in the system that cannot be 
handled by the operating system. 
See also exception. 

main memory Physical memory 
other than what the processor 
caches. 
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main procedure A procedure 
invoked by the operating system 
when an application program 
starts up. The main procedure is 
the main program in Fortran; in C 
and C++, it is the function main(). 

main program I n a Fortran 
program, the program section 
invoked by the operating system 
when the program starts up. 

Mbyte See megabyte (M byte). 

megabyte (Mbyte) 1048576 
(2 20 ) bytes. 

megaflops (MFLOPS) One 

million floating-point operations 
per second. 

memory bank conflict An 

attempt to access a particular 
memory bank before a previous 
access to the bank is complete, or 
when the bank is not yet finished 
recycling (i.e., refreshing). 

memory management The 

hardware and software that 
control memory page mapping and 
memory protection. 

message Data copied from one 
process to another (or the same) 
process. The copy is initiated by 
the sending process, which 
specifies the receiving process. The 
sending and receiving processes 
need not share a common address 
space. (Note: depending on the 
context, a process may be a 
thread.) 

Message-Passing Interface 
(MPI) A message-passing and 
process control library. For 
information on the Flewlett- 


Packard implementation of M PI, 
refer to the FI P MPI User's Guide 
(B 6011-90001). 

message passi ng A type of 
programming in which program 
modules (often running on 
different processors or different 
hosts) communicate with each 
other by means of system library 
calls that package, transmit, and 
receive data. All message-passing 
library calls must be explicitly 
coded by the programmer. 

MIMD (multiple instruction 
stream multiple data stream) 

A computer architecture that uses 
multiple processors, each 
processi ng its own set of 
instructions simultaneously and 
independently of others. MIMD 
also describes when processes are 
performing different operations on 
different data. Compare 
with SI M D. 

multiprocessing The creation 
and scheduling of processes on any 
subset of CPUs in a system 
configuration. 

mutex A variable used to 
construct an area (region of code) 
of mutual exclusion. When a mutex 
is locked, entry to the area is 
prohibited; when the mutex is free, 
entry is allowed. 

mutual exclusion A protocol 
that prevents access to a given 
resource by more than one thread 
at a ti me. 

negate An instruction that 
changes the sign of a number. 
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network A system of 
interconnected computers that 
enables machines and their users 
to exchange information and share 
resources. 

node On HP seal able and 
nonscal able servers, a node is 
equivalent to a hypernode The 
term "node” is generally used in 
place of hypernode. 

non-uniform memory access 
(NUMA) This term describes 
memory access ti mes i n systems i n 
which accessing different types of 
memory (for example, memory 
local to the current hypernodeor 
memory remote to the current 
hypernode) results in non-uniform 
access ti mes. 

nonblocking crossbar A 

switching device that connects the 
CPUs, banks of memory, and I/O 
controller on a single hypernode. 
Because the crossbar is 
nonblocking, all ports can run at 
full bandwidth simultaneously 
provided there is not contention for 
a particular port. 

NUMA Non-uniform memory 
access. This term describes 
memory access ti mes i n systems i n 
which accessing different types of 
memory (for example, memory 
local to the current hypernodeor 
memory remote to the current 
hypernode) results in non-uniform 
access ti mes. 

offset I n the context of a process 
address space, an integer value 
that is added to a base address to 
calculate a memory address. 
Offsets in V2250 servers are 64-bit 


values, and must keep address 
values within a single 16-Tbyte 
memory space. 

opcode A predefi ned sequence of 
bits in an instruction that specifies 
the operation to be performed. 

operating system The program 
that manages the resources of a 
computer system. V2250 servers 
use the H P-U X operati ng system. 

optimization The refining of 
application software programs to 
minimize processing time. 
Optimization takes maximum 
advantage of a computer's 
hardware features and minimizes 
idle processor time. 

optimization level The degree 
to which source code is optimized 
by the compiler. The HP compilers 
offer five levels of optimization: 
level +oo, + 01 , + 02 , +03, and +04. 
The +04 option is not available in 
Fortran 90. 

oversubscript An array 
reference that falls outside 
declared bounds. 

oversubscription I n the 

context of parallel threads, a 
process attri bute that permits the 
creation of more threads within a 
process than the number of 
processors avail able to the process. 

PA-RISC The Hewlett-Packard 
Precision Architecture reduced 
instruction set. 

packet A group of related items. 
A packet may refer to the 
arguments of a subrouti ne or to a 
group of bytes that is transmitted 
over a network. 
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page A page is the unit of virtual 
or physical memory controlled by 
the memory management 
hardware and software. On H P-UX 
servers, the default pagesizeis4K 
(4,096) contiguous bytes. Valid 
page sizes are: 4 K, 16 K, 64 K, 256 
K, 1 M byte, 4 M bytes, 16 M bytes, 
64 M bytes, and 256 M bytes. 

See also virtual memory. 

pagefault A page fault occurs 
when a process requests data that 
is not currently in memory. This 
requires the operating system to 
retrieve the page containing the 
requested data from disk. 

page frame A page frame is the 
unit of physical memory in which 
pages are placed. Referenced and 
modified bits associated with each 
page frame aid in memory 
management. 

parallel optimization The 

transformation of source code into 
parallel code (parallelization) and 
restructuring of code to enhance 
parallel performance. 

parallelization The process of 
transforming serial code to a form 
of code that can run 
simultaneously on multipleCPUs 
while preserving semantics. When 
+03 +Oparaiiei is specified, the 
HP compilers automatically 
parallelize loops in your program 
and recognize compiler directives 
and pragmas with which you can 
manually specify parallelization of 
loops, tasks, and regions. 

parallelization, loop The 

process of splitting a loop into 
several smaller loops, each of 


which operates on a subset of the 
data of the original loop, and 
generating code to run these loops 
on separate processors in parallel. 

parallelization, ordered The 

process of splitting a loop into 
several smaller loops, each of 
which iterates over a subset of the 
original data with a stride equal to 
the number of loops created, and 
generating code to run these loops 
on separate processors. Each 
iteration in an ordered parallel 
loop begins execution in the 
original iteration order, allowing 
dependences within the loop to be 
synchronized to yield correct 
results via gate constructs. 

parallelization, stride-based 

The process of splitting up a loop 
into several smaller loops, each of 
which iterates over several 
discontiguous chunks of data, and 
generating code to run these loops 
on separate processors in parallel. 
Stride-based parallelism can only 
be achieved manually by using 
compiler directives. 

parallelization, strip-based 

The process of splitting up a loop 
into several smaller loops, each of 
which iterates over a single 
contiguous subset of the data of the 
original loop, and generating code 
to run these loops on separate 
processors in parallel. Strip-based 
parallelism is the default for 
automatic parallelism and for 
directive-initiated loop parallelism 
in absence of the chunk_size = n 
or ordered attributes. 
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parallelization, task The 

process of splitting up source code 
into independent sections which 
can safely be run in parallel on 
available processors. HP 
programming languages provide 
compiler directives and pragmas 
that allow you to identify parallel 
tasks in source code. 

parameter I n C and C++, either 
a variable declared in the 
parameter list of a procedure 
(function) that receives a value 
when the procedure is called 
(formal parameter) or the variable 
or constant that is passed by a call 
to a procedure (actual parameter). 
In Fortran, a symbolic name for a 
constant. 

path An environment variable 
that you set within your shell that 
allows you to access commands in 
various directories without having 
to specify a complete path name. 

physical address A unique 
identifier that selects a particular 
location in the computer's memory. 
Because HP-UX supports virtual 
memory, programs address data by 
its virtual address; HP-UX then 
maps this address to the 
appropriate physical address. See 
also virtual address. 

physical address space The 

set of possi bl e addresses for a 
particular physical memory. 

physical memory Computer 
hardware that stores data. V2250 
servers can contain up to 16 
Gbytes of physical memory on a 
16-processor hypernode. 


pipeline An overlapping 
operating cycle function that is 
used to i ncrease the speed of 
computers. Pipelining provides a 
means by which multiple 
operations occur concurrently by 
beginning one instruction 
sequence before another has 
completed. Maximum efficiency is 
achieved when the pipeline is 
"full," that is, when all stages are 
operating on separate instructions. 

pipelining Issuing instructions 
in an order that best uses the 
pipeline. 

procedure A unit of program 
code. I n Fortran, a function, 
subroutine, or main program; in C 
and C++, a function. 

process A col I ecti on of one or 
more execution streams within a 
single logical address space; an 
executable program. A process is 
made up of one or more threads. 

process memory The portion of 
system memory that is used by an 
executing process. 

programming model A 

description of the features 
availableto efficiently program a 
certain computer architecture. 

program unit A procedure or 
main section of a program. 

queue A data structure in which 
entries are made at one end and 
deletions at the other. Often 
referred to as first-in, first-out 
(FIFO). 

rank The number of dimensions 
of an array. 
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read A memory operation in 
which the contents of a memory 
location are copied and passed to 
another part of the system. 

recurrence A cycle of 
dependences among the operations 
within a loop in which an operation 
in one iteration depends on the 
result of a following operation that 
executes in a previous iteration. 

recursion An operation that is 
defined, at least in part, by a 
repeated application of itself. 

recursive call A condition in 
which the sequence of instructions 
in a procedure causes the 
procedure itself to be invoked 
again. Such a procedure must be 
compiled for reentrancy. 

reduced instruction set 
computer (RISC) An 

architectural concept that applies 
to the definition of the instruction 
set of a processor. A Rl SC 
instruction set is an orthogonal 
instruction set that is easy to 
decode i n hardware and for which 
a compiler can generate highly 
optimized code. The PA-RI SC 
processor used in V2250 servers 
employ a RISC architecture. 

reduction An arithmetic 
operation that performs a 
transformation on an array to 
produce a scalar result. 

reentrancy The ability of a 
program unit to be executed by 
multiplethreads at thesametime. 
Each invocation maintains a 
private copy of its local data and a 
private stack to store compiler¬ 
generated temporary variables. 


Procedures must be compiled for 
reentrancy in order to be invoked 
in parallel or to be used for 
recursive cal Is. HP compilers 
compilefor reentrancy by default. 

reference Any operation that 
requi res a cache I i ne to be 
encached; this includes load as 
well as store operations, because 
writing to any element in a cache 
line requires the entire cache line 
to be encached. 

register A hardware entity that 
contains an address, operand, or 
instruction status information. 

reuse, data I n the context of a 
loop, theability tousedata fetched 
for one loop operation in another 
operation. I n the context of a 
cache, reusing data that was 
encached for a previous operation; 
because data is fetched as part of a 
cache line, if any of the other items 
in the cache line are used before 
the line is flushed to memory, 
reuse has occurred. 

reuse, spatial Reusing data 
that resi des i n the cache as a 
result of the fetching of another 
piece of data from memory. 
Typically, this involves using array 
elements that are contiguous to 
(and therefore part of the cache 
line of) an element that has 
al ready been used, and therefore is 
already encached. 

reuse, temporal Reusi ng a data 
item that has been used previously. 

RISC Reduced instruction set 
computer. An architectural concept 
that applies to the definition of the 
instruction set of a processor. A 
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Rl SC instruction set is an 
orthogonal instruction set that is 
easy to decode i n hardware and for 
which a compiler can generate 
highly optimized code. The 
PA-RISC processor used in V2250 
servers employs a Rl SC 
architecture. 

rounding A method of obtaining 
a representation of a number that 
has less precision than theoriginal 
in which the closest number 
representable under the lower 
precision system is used. 

row-major order Memory 
representation of an array such 
that the rows of an array are 
stored contiguously. For example, 
given a two-dimensional array 
a [ 3 ] [ 4 ], array el ement a [ o ] [ 3 ] 
immediately precedes A[i] [0] in 
memory. This is thedefault storage 
method for arrays in C. 

scope The domain in which a 
variable is visible in source code. 
The rules that determi ne scope are 
different for Fortran and C/C++. 

semaphore An integer variable 
assigned one of two values: one 
value to indicate that it is "locked," 
and another to indicate that it is 
"free." Semaphores can be used to 
synchronize parallel threads. 
Pthreads provides a set of 
manipulation functions to 
facilitate this. 

shape The number of elements 
in each dimension of an array. 

shared virtual memory A 

memory architecture in which 
memory can be accessed by all 


processors in the system. This 
architecture can also support 
virtual memory. 

shell An interactive command 
interpreter that is the interface 
between the user and the U nix 
operating system. 

SIMD (singleinstruction 
stream multiple data stream) 

A computer architecture that 
performsoneoperation on multiple 
sets of data. A processor (separate 
from the SM P array) is used for 
the control logic, and the 
processors in the SM P array 
perform the instruction on the 
data. Compare with MIM D 
(multipie instruction stream 
multi pie data stream). 

single A single-precision 
floating-point number stored in 32 
bits. See also double. 

SMP Symmetric multiprocessor. 
A multiprocessor computer in 
which all the processors have 
equal access to all machine 
resources. Symmetric 
multiprocessors have no manager 
or worker processors; the operating 
system runs on any or all of the 
processors. 

socket An endpoint used for 
interprocess communication. 

socket pair Bidirectional pipes 
that enable application programs 
to set up two-way communication 
between processes that share a 
common ancestor. 
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source code The uncompiled 
version of a program, written in a 
high-level language such as 
Fortran or C. 

sourcefile A file that contains 
program source code. 

space A contiguous range of 
virtual addresses within the 
system-wide virtual address space. 
Spaces are 16 Tbytes in the V2250 
servers. 

spatial reference An attribute 
of a memory reference pattern that 
pertai ns to the I i kel i hood of a 
subsequent memory reference 
address being numerically close to 
a previously referenced address. 

spawn To activate existing 
threads. 

spawn context A parallel loop, 
task list, or region that initiates 
the spawning of threads and 
defines the structure within which 
the threads' spawn thread I Ds are 
valid. 

spawn thread identifier 
(stid) A sequential integer 
identifier associated with a 
particular thread that has been 
spawned, stidsareonly assigned to 
spawned threads, and they are 
assigned within a spawn context; 
therefore, duplicate stids may be 
present amongst the threads of a 
program, but stids are always 
unique within the scope of their 
spawn context, stids are assigned 
sequentially and run from 0 to one 
less than the number of threads 
spawned in a particular spawn 
context. 


SPMD Single program multiple 
data. A single program executing 
simultaneously on several 
processors. This is usually taken to 
mean that there is redundant 
execution of sequential scalar code 
on all processors. 

stack A data structure in which 
the last item entered is the first to 
be removed. Also referred to as 
last-in, first-out (LIFO). HP-UX 
provides every thread with a stack 
which is used to pass arguments to 
functions and subroutines and for 
local variable storage. 

store An instruction used to 
move the contents of a register to 
memory. 

strip length, parallel I n strip- 
based parallelism, the amount by 
which the induction variable of a 
parallel inner loop is advanced on 
each iteration of the (conceptual) 
controlling outer loop. 

strip mining The 

transformation of a single loop into 
two nested loops. Conceptually, 
this is how parallel loops are 
created by default. A conceptual 
outer loop advances the initial 
value of the inner loop's induction 
variable by the parallel strip 
length. The parallel strip length is 
based on the trip count of the loop 
and the amount of code in the loop 
body. Strip mining is also used by 
the data localization optimization. 

subroutine A software module 
that can be i nvoked from anywhere 
in a program. 
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superscalar A class of RISC 
processors that allow multiple 
instructions to be issued in each 
clock period. 

Symmetric Multiprocessor 
(SMP) A multiprocessor 
computer in which all the 
processors have equal access to all 
machine resources. Symmetric 
multiprocessors have no manager 
or worker processors; theoperati ng 
system runs on any or all of the 
processors. 

synchronization A method of 
coordinating the actions of 
multiplethreads sothat operations 
occur in the right sequence. When 
manually optimizing code, you can 
synchronize programs using 
compiler directives, calls to library 
routines, or assembly-language 
instructions. You do so, however, at 
the cost of additional overhead; 
synchronization may cause at least 
one CPU to wait for another. 

system administrator 
(sysadmin) The person 
responsible for managing the 
administration of a system. 

system manager The person 
responsible for the management 
and operation of a computer 
system. AI so cal led the system 
administrator and the sysadmin. 

Tbyte See terabyte (Tbyte). 

terabyte (Tbyte) 

1099511627776 (2 40 ) bytes. 

term A constant or symbolic 
name that is part of an expression. 


thread An independent 
execution stream that is executed 
by a CPU. One or more threads, 
each of which can execute on a 
different CPU, make up each 
process. Memory, files, signals, and 
other process attributes are 
generally shared among threads in 
a given process, enabling the 
threads to cooperate i n solving the 
common problem. Threads are 
created and terminated by 
instructions that can be 
automatically generated by HP 
compilers, inserted by adding 
compiler directives to source code, 
or coded explicitly using library 
calls or assembly-language. 

thread create To activate 
existing threads. 

thread identifier An integer 
identifier associated with a 
particular thread. See thread 
identifier, kernel (ktid) and thread 
identifier, spawn (stid). 

thread identifier, kernel 
(ktid) A uniqueinteger identifier 
(not necessarily sequential) 
assigned when a thread is created. 

thread identifier, spawn 
(stid) A sequential integer 
identifier associated with a 
particular thread that has been 
spawned, stids areonly assigned to 
spawned threads, and they are 
assigned within a spawn context; 
therefore, duplicate stids may be 
present amongst the threads of a 
program, but stids are always 
unique within the scope of their 
spawn context, stids are assigned 
sequentially and run from 0 to one 
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less than the number of threads 
spawned in a particular spawn 
context. 

thread-private memory Data 
that is accessible by a single 
thread only (not shared among the 
threads constituting a process). 

translation lookaside buffer 

A hardware entity that contains 
information necessary to translate 
a virtual memory reference to the 
corresponding physical page and to 
validate memory accesses. 

TLB See translation lookaside 
buffer. 

trip count The number of 
iterations a loop executes. 

unsigned A value that is always 
positive. 

user interface The portion of a 
computer program that processes 
input entered by a human and 
provides output for human users. 

utility A software tool designed 
to perform a frequently used 
support function. 

vector An ordered list of items 
in a computer's memory, contained 
within an array. A simple vector is 
defined as having a starting 
address, a length, and a stride. An 
indirect address vector is defined 
as having a relative base address 
and a vector of values to be applied 
as offsets to the base. 

vector processor A processor 
whose instruction set includes 
instructions that perform 


operations on a vector of data (such 
as a row or column of an array) in 
an optimized fashion. 

virtual address The address by 
which programs access their data. 
HP-UX maps this address to the 
appropriate physical memory 
address. See also space. 

virtual aliases Two different 
virtual addresses that map to the 
same physical memory address. 

virtual machine A collection of 
computing resources configured so 
that a user or process can access 
any of the resources, regardless of 
their physical location or operating 
system, from a single interface. 

virtual memory The memory 
space as seen by the program, 
which is typically larger than the 
available physical memory. The 
virtual memory of a V2250 server 
can be up to 16 Tbytes. The 
operating system maps this virtual 
memory to a smaller set of physical 
memory, using disk space to make 
up the difference if necessary. Also 
called logical memory. 

wall-clock time The 

chronological time an application 
requires to complete its processing. 
If an application starts running at 
1:00 p.m. and finishes at 5:00 a.m. 
the following morning, its wall- 
clock time is sixteen hours. 
Compare with CPU time. 

word A contiguous group of 
bytes that make up a primitive 
data operand and start on an 
addressable boundary. I n V2250 
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servers a word is four 

bytes (32 bits) in length. See also 

doubleword. 

workstation A stand-alone 
computer that has its own 
processor, memory, and possibly a 
disk drive and can typically sit on 
a user's desk. 

write A memory operation in 
which a memory location is 
updated with new data. 

zero I n floating-point number 
representations, zero is 
represented by the sign bit with a 
value of zero and the exponent 
with a value of zero. 
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Index 


Symbols 

&operator, 32 
+DA, 142 

+DAarchitecture, 141 
+DS, 142 
+DSmodel, 141 
+O[no]aggressive, 114 
+O[no]all, 114, 118 
+O[no]autopar, 114, 118 
+O[no]conservative, 114, 119 
+O[no]dataprefetch, 114, 119 
+O[no]dynsel, 114, 120, 149 
+O[no]entrysched, 117, 120 
+O[no]fail_safe, 114, 121 
+O[nojfastaccess, 114, 121 
+0[no]fltacc, 114, 117, 121 
+O[no]global_ptrs_unique, 114, 122, 143 
+0[no]info, 114, 123, 151 
+O[no]initcheck, 115, 117, 123 
+O[no]inline, 55, 57, 91, 92, 112, 115, 124 
+O[no]libcalls, 115, 117, 125 
+O[no]limit, 58, 115, 118, 126 
+O[no]loop_block, 58, 70, 115, 127, 148 
+O[no]loop_transform, 58, 70, 79, 82, 84, 89,115, 
127, 148 

+O[no]loop_unroll, 58, 127 
+O[no]loop_un roll Jam, 84, 115, 128, 150 
+O[no]moveflops, 115, 128 
+O[no]multiprocessor, 115, 129 
+O[no]parallel, 94, 115, 149, 160 
+O[no]parmsoverlap, 115, 130 
+O[no]pipeline, 49, 115, 130 
+O[no]procelim, 115, 131 
+O[no]ptrs_ansi, 115, 131, 143, 275 
+O[no]ptrs_strongly_typed, 115, 132, 275 
+0[no]ptrs_to_globals, 115, 135, 143 
+0[nojregreassoc, 115, 136 
+0[no]report, 115, 137, 152, 160 
+0[no]sharedgra, 115, 138 
+0[no]signedpointers, 116, 117, 138 
+0[no]size, 58, 116, 138 
+0[no]static_prediction, 116, 139 
+0[no]vectorize, 116, 117, 139 


+0[no]volatile, 116, 140 

+0[no]whole_program_mode, 116, 140 

+00 optimization, 27 

+01 optimization, 27 

+02 optimization, 28, 40, 58 

+03, 111 

+03 optimization, 28, 55, 57, 58, 70, 77, 79, 82, 
84, 89 
+04, 111 

+04 optimization, 55, 57 

+Oinline_budget, 55, 92, 115, 125 

+Onoinitcheck, 31 

+Oparallel, 111 

+pd, 23 

+pi, 23 

+tmtarget, 141 
[mc]_fetch_and_add32(), 338 
[mc]_fetch_and_clear32(), 338 
[mc]_fetch_and_dec32(), 338 
[mc]_fetch_and_inc32(), 338 
[mc]_fetch_and_set3(), 338 
[mc]_fetch32(), 338 
[mc]_init32(), 337 

A 

aC+-+-compiler 
location of, 25 
register allocation, 44 
aC+-+-, parallelism in. 111 
accessing pthreads, 309, 310 
accumulator variables, 289 
actual registers, 40 
address space, virtual, 17 
address-exposed array variables, 144 
addressing, 41 

advanced scalar optimizations, 7 
aggressive optimizations, 118 
algorithm, type-safe, 274 
aliases, 12 
hidden, 276 
potential, 275 
aliasing, 59, 64, 69, 274 
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algorithm, 274 
examples, 64, 65 
mode, 275 
rules, 132 
stop variables, 277 
aliasing rules, type-inferred, 143 
alignment 
data, 27, 37 
of arrays, 282 
simple, 27 

alloc_barrier functions, 247 
a 11 oc_gate functions, 247 
alloca(), 125 

ALLOCATE statement, 12, 282 
allocating 
barriers, 245 
gates, 245 
shared memory, 138 
storage, 204 
allocation functions, 247 
alternate name for object, 64 
Analysis Table, 154, 158 
analysis, flow-sensitive, 277 
ANSI C, 274 
aliasing algorithm, 274 
ANSI standard rules, 273 
architecture 
SMP, 1, 2 

architecture optimizations, 141 
arguments 
block_factor, 71 
dummy, 246 

arithmetic expressions, 31, 43, 49, 51, 136 
array, 33 

address computations, 136 
address-exposed, 144 
bounds of, 31 
data, fetch, 71 
dimensions, 204 
indexes, 59 
references, 32 
subscript, 106 
arrays 


access order, 82 
alignment of, 282 
dummy arguments, 286 
equivalencing, 12 
global, 282 

LOOP_PRIVATE, 225 
of type specifier, 237 
store, 64 
strips of, 70 
unaligned, 286 
asin math function, 126 
assertion, linker disables, 141 
asymmetric parallelism, 329 
asynchronous interrupts, 120 
atan math function, 126 
atan2 math function, 126 
attributes 

LOOP_PARALLEL, 181 
PREFE R_PARALLEL, 181 
volatile, 33 

automatic parallelism, 94 
avoid loop interchange, 63 

B 

barrier variable declaration, 245 
barriers, 245, 332 
allocating, 245 
deallocating, 247 
equivalencing, 246 
high-level, 313 
wait, 249 
basic blocks, 6 

BEGI N_TASKS directive and pragma, 94, 177, 
192 

block factor, 76 

BLOCK_LOOP directiveand pragma, 70, 76,146, 
148 

blocking, loop, 70 
bold monospace, xvii 
brackets, xvi i 
curly, xvii 
branch 
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destination, 67, 68 
dynamic prediction, 139 
optimization, 27 
static prediction, 139 
branches 

conditional, 39, 139 
instruction, 41 
transforming, 39 
unconditional, 39 


C aliasing options, 113 
C compiler 
location of, 25 
register allocation, 44 
-C compiler option, 291 
cache 

contiguous, 18 
data, 12 
line, 12 

line boundaries, 283 
line size, 71 
lines, fetch, 73 
lines, fixed ownership, 299 
padding, 15 
semaphores, 335 
thrashing, 13, 78, 279, 298 
cache line boundaries 
force arrays on (C), 282 
force arrays on (Fortran), 282 
cache-coherency, 12 
cache-line, 18 
calls 

cloned, 154, 155 
inlined, 154, 155 
char, 32 
chatr utility, 23 
check subscripts, 291 
child threads, 317 
CHUNK_SIZE, 283 
class, 237 

memory, 233, 235, 236, 237, 238 


cloned 

calls, 154, 155 
procedures, delete, 140 
cloning, 57, 102, 112 
across files, 57 
across multi pie files, 112 
at +04, 91 
within files, 57 
within one source file, 57 
Code, 257 
code 

contiguous, 197 
dead, 27 
entry, 40 
examining, 302 
exit, 40 

isolate in loop, 259 
loop-invariant, 45 
motion, 136, 250, 274 
parallelizing outside loop, 192 
scalar, 197 
size, 124 

synchronizing, 257 
transformation, 34 
coding 

guidelines, 31, 32 
standards, 91 
command syntax, xviii 
command-line options, 55, 115 
+O[no]_block_loop, 70 
+O[no]_loop_transform, 89 
+O[no]aggressive, 114, 117 
+O[no]all, 114, 118 
+O[no]autopar, 114, 118 
+O[no]conservative, 114, 119 
+O[no]dataprefetch, 114, 119 
+O[no]dynsel, 114, 120 
+O[no]entrysched, 114, 120 
+O[no]fail_safe, 114, 121 
+O[nojfastaccess, 114, 121 
+O[no]fltacc, 114, 121 
+O[no]global_ptrs, 143 
+O[no]global_ptrs_unique, 114, 122 
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+0[no]info, 114, 123 
+0[no]initcheck, 123 
+0[no]inline, 55, 91, 92, 115, 124 
+0[no]libcalls, 115, 125 
+0[no]limit, 45, 115, 126 
+0[no]loop_block, 115, 127 
+0[no]loop_transform, 58, 70, 79, 89, 115, 127 
+0[no]loop_unroll, 58, 127 
+0[no]loop_unroll, 115 
+0[no]loop_unrollJam, 58, 115, 128 
+0[no]moveflops, 115, 128 
+0[no]multi processor, 115, 129 
+0[no]parallel, 94, 115, 129 
+0[no]parmsoverlap, 115, 130 
+0[no]pipeline, 49, 115, 130 
+0[no]procelim, 115, 131 
+0[no]ptrs_ansi, 115, 131, 143, 275 
+0[no]ptrs_strongly_typed, 115, 132, 275 
+0[no]ptrs_to_globals, 115, 135, 143 
+0[no]regreassoc, 115, 136 
+0[no]report, 115, 137 
+0[no]sharedgra, 115, 138 
+0[no]signedpointers, 116, 138 
+0[no]size, 45, 116, 138 
+0[no]static_prediction, 116, 139 
+0[no]vectorize, 116, 139 
+0[no]volatile, 116, 140 
+0[no]whole_program_mode, 116, 140 
+Oinline_budget, 55, 92, 115, 125 
-tfmtarget, 141 
COMMON, 34 
blocks, 18, 147, 237, 282 
statement, 246 
variable, 91, 150 

common subexpression elimination, 42, 43, 135 
compilation, abort, 121 
compile 
reentrant, 201 
time, 44, 126 
compile time, increase, 49 
compiler assumptions, 304 
compiler options 
-C, 291 


-W, 290 

Compiler Parallel Support Library, 309 
compilers 
I ocati on of, 25 
location of aC-i-h 25 
location of C, 25 
location of Fortran 90, 25 
cond_lock_gatefunctions, 248 
conditional 
blocks, 197 
branches, 139 
constant 
folding, 27 
induction, 28 
contiguous 
cache lines, 18 
code, 197 

control variable, 31 
copy propagation, 135 
core dump, 291 
CPS cache, 335 
cps_barrier_free(), 332 
cps_nsthreads(), 311 
cps_nthreads(), 318 
cps_plevel(), 329 
cps_ppcall(), 318 
cps_ppcalln(), 318 
cps_ppcallv(), 311 
CPS_STACK_SIZE, 202, 317 
cps_stid(), 311, 318 
cps_thread_create<), 329 
cps_thread_createn(), 329 
cps_thread_exit(), 329 
cps_thread_register_lock(), 329 
CPSlib, 309 

low-level counter semaphores, 337 
low-level locking functions, 337 
unlock routines, 336 
unmappablefunctions in pthreads, 318 
CPSlib asymmetric functions, 312 
cps_thread_create(), 312 
cps_thread_createn(), 312 
cps_thread_exit(), 312 
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cps_thread_register_lock(), 312 
cps_thread_wait(), 312 
CPSlib informational functions, 312 
cps_complex_cpus(), 312 
cps_complex_nodes(), 313 
cps_complex_nthreads(), 313 
cps_is_parallel(), 313 
cps_plevel(), 313 
cps_set_threads(), 313 
cps_topology(), 313 
CPSlib symmetric functions, 311 
cps_nsthreads(), 311 
cps_ppcall(), 311 
cps_ppcalln(), 311 
cps_ppcallv(), 311 
cps_stid(), 311 
cps_wait_attr(), 312 

CPSlib synchronization functions, 314, 315 
[mc]_cond_lock(), 315 
[mc]_fetch_and_add32(), 315 
[mc]_fetch_and_clear32(), 315 
[mc]_fetch_and_dec32(), 316 
[mc]_fetch_and_inc32(), 316 
[mc]_fetch_and_set32(), 316 
[mc]_fetch32(), 315 
[mc]_free32(), 315 
[mc]_init32(), 315, 316 
cps_barrier(), 313 
cps_barrier_alloc(), 313 
cps_barrier_free{), 314 
cps_l i mited_spi n_mutex_al I oc(), 314 
cps_mutex_alloc(), 314 
cps_mutex_free(), 314 
cps_mutex_lock(), 314 
cps_mutex_trylock(), 315 
cps_mutex_unlock(), 315 
CPU agent, 10 
create 

temporary variable, 277 
threads, 317 
critical sections, 254 
conditionally lock, 265 
using, 257 


CRITICAL_SECTION directive and pragma, 177, 
189, 255 

example, 190, 257, 258 
cross-module optimization, 53 
cumlativeoptimizations, 58 
cumulative options, 30 
curly brackets, xvii 

D 

data 

alignment, 12, 27, 37, 71, 91 
cache, 7, 12, 58, 69, 119 
dependences, 179, 185, 192, 287 
encached, 13 
exploit cache, 102 
item, 238, 241 
items, different, 279 
layout, 279 

local to procedure, 239 

localization, 28, 58, 59, 64, 69 

multi pie dependences, 243 

object, 220 

objects (C/C++), 237 

prefetch, 119 

private, 235 

privatizing, 218 

reuse, 12, 13, 71 

segment, 23 

shared, 239 

type statements (C/C++), 245 
types, double, 239 
DATA statement, 235, 282 
data-localized loops, 7 
dead code elimination, 27, 40 
deadlock, detect with pthreads, 335 
deallocating 
barriers, 247 
gates, 245, 257 
deallocation functions, 247 
default stack size, 175, 202 
delete 

cloned procedures, 140 
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inlined procedures, 140 
dependences, 229 
data, 179, 185, 192, 287 
element-to-element, 62 
ignore, 149 
loop-carried, 287, 292 
multiple, 243 
nonordered, 254 
ordered data, 243 
other loop fusion, 64 
synchronize, 197 
synchronized, 182 
synchronizing, 255 
dereferences of pointers, 143 
DIMENSION statement, 246 
Dipasquale, Mark D., 318 
directives 

BEGI N_TASKS, 94, 177, 192 
BLOCK_LOOP, 70, 76, 146, 148 
CRITICAL_SECTION, 177, 189, 254, 257 
DYNSEL, 146, 148 

E N D_CRITI CAL_SECTI ON, 177, 189, 254 
END_ORDERED_SECTION, 255 
END_PARALLEL, 28, 94, 176 
END_TASKS, 94, 177, 192 
LOOP_PARALLEL, 28, 94, 118, 176, 179, 181, 
185 

LOOP_PARALLEL(ORDERED), 253 
LOOP_PRIVATE, 218, 220 
misused, 292 

NEXT_TASK, 94, 177, 192 
NO_BLOCK_LOOP, 70, 146, 148 
NO_DI STRI BUTE, 77, 146, 148 
NO_DYNSEL, 146, 149 
NO_LOOP_DEPENDENCE, 60, 63, 149 
NO_LOOP_TRANSFORM, 89, 146, 149 
NO_PARALLEL, 110, 146, 149 
NO_SI DE_EFFECTS, 146, 150 
NO_UNROLL_ANDJ AM, 85, 146 
ORDERED_SECTION, 177, 255 
PARALLEL, 94, 176 
parallel, 28 

PARALLEL_PRIVATE, 218, 229 


PRE FE R_PARALLEL, 28, 94, 176, 178, 181, 
185 

privatizing, 218 
REDUCTION, 146, 177 
SAVE_LAST, 218, 224 
SCALAR, 146 

SYNC_ROUTI NE, 146, 177, 250 
TASK_PRIVATE, 196, 218, 227 
UNROLL_ANDJ AM, 85, 146, 150 
disable 

automatic parallelism, 110 
global register allocation, 138 
LCDs, 60 

loop thread parallelization, 191 
division, 40 
DO loops, 178, 220 
DO WHILE loops, 184 
double, 49 
data types, 239 
variable, 130, 290 
dummy 
argument, 246 
arguments, 286 
registers, 40 

dynamic selection, 120, 154, 155 
workload-based, 102, 149 
DYNSEL directive and pragma, 146, 148 


element-to-element dependences, 62 
ellipses, vertical, xviii 
encache memory, 20 

E N D_CRITI CAL_SECTI ON directive and 
pragma, 177, 189, 255 
end_parallel, 28 

END_PARALLEL directive and pragma, 28, 94, 
176 

END_TASKS directive and pragma, 94, 177, 192 
enhance performance, 12 
entry code, 40 
environment variables 
and pthreads, 317 
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CPS_STACK_SIZE, 202, 317 
M P_l DLE_TH READS_WAIT, 100, 317 
MP_NUMBER_OF_THREADS, 94, 130, 317 
EQUIVALENCE statement, 64, 274 
equivalencing 
barriers, 246 
gates, 246 

equivalent groups, constructing, 144 
ERRNO, 126 
examining code, 302 
examples 
aliasing, 64 
apparent LCDs, 106 
avoid loop interchange, 63 
branches, 40 
cache padding, 15 
cache thrashing, 13 
common subexpression elimination, 43 
conditionally lock critical sections, 265 
critical sections and gates, 264 
CRITICAL_SECTION, 190, 257 
data alignment, 37 

denoting induction variables in parallel loops, 
222 

gated critical sections, 258 

I /O statements, 67 

inlining with onefile, 55 

inlining within one source file, 55 

interleaving, 20 

loop blocking, 76 

loop distribution, 77 

loop fusion, 80 

loop interchange, 82 

loop peeling, 80 

loop transformations, 97 

loop unrolling, 45, 46 

LOOP_PARALLEL, 187, 188 

LOOP_PARALLEL(ORDERED), 253 

LOOP_PRIVATE, 221 

loop-invariant code motion, 45 

loop-level parallelism, 94 

matrix multiply blocking, 74 

multiple loop entries/exits, 68 


NO_PARALLEL, 110 
node_private, 241 
Optimization Report, 160 
ordered section limitations, 261, 262 
output LCDs, 106 
PARALLEL_PRIVATE, 229 
parallelizing regions, 199 
parallelizing tasks, 195, 196 
PREFER_PARALLEL, 187, 188 
reduction, 109 
SAVE_LAST, 225 
secondary induction variables, 223 
software pipelining, 49 
strength reduction, 52 
strip mining, 54 
SYNC_ROUTI NE, 251, 252 
TASK_PRIVATE, 227 
test promotion, 90 
thread_private, 238, 239 
thread_privateCOMMON blocks in parallel 
subroutines, 239 
type aliasing, 134 
unroll and jam, 85 
unsafe type cast, 133 
unused definition elimination, 52 
using LOOP_PRI VATE w/LOOP_PARALLEL, 
221 

executable files, large, 55, 92 
execution speed, 130 
exit 

code, 40 
statement, 68 

explicit pointer typecast, 144 
exploit data cache, 102 
extern variable, 91 
external, 282 

F 

fabs(), 125 

fall-through instruction, 39 
false cache line sharing, 13, 279 
faster register allocation, 40 
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file 

level, 89 
scope, 32, 237 
file-level optimization, 28 
fixed ownership of cache lines, 299 
float, 49 

float variable, 130 
floating-point 
calculation, 126 
expression, 289 
imprecision, 289 
instructions, 128 
traps, 128 

floating-point instructions, 41 

flow-sensitive analysis, 277 

flush to zero, 290 

FMA, 121 

folding, 43, 136 

for loop, 178, 220 

force 

arrays to start on cache line boundaries (C), 282 
arrays to start on cache line boundaries 
(Fortran), 282 
parallelization, 176, 179 
reduction, 177 
form of 

alloc_barrier, 247 
alloc_gate, 247 
barrier, 245 
blockjoop, 70 
cond_lock_gate, 248 
CRITICAL_SECTION, 254 
directive names, 147 
END_CRITICAL_SECTION, 254 
END_ORDERED_SECTION, 255 
free_barrier, 247 
free_gate, 247 
gate, 245 
lock_gate, 248 
LOOP_PRIVATE, 220 
memory class assignments, 236 
no_block_loop, 70 
no_distribute, 77 


no_loop_dependence, 60 
no_loop_transform, 89 
no_unroll_andJam, 85 
ORDERED_SECTION, 255 
PARALLEL_PRIVATE, 229 
pragma names, 147 
reduction, 108 
SAVE_LAST, 225 

SYNC_ROUTINE directive and pragma, 250 
TASK_PRIVATE, 227 
unlock_gate, 249 
unroll_andJam, 85 
Fortran 90 compiler 
guidelines, 34 
I ocati on of, 25 
free_barrier functions, 247 
free_gate functions, 247 
functions 
alloc_barrier, 247 
alloc_gate, 247 
allocation, 247 
cond_lock_gate, 248 
deallocation, 247 
free_barrier, 247 
free_gate, 247 
lock_gate, 248 
locking, 248 
malloc(C), 13, 282 
memory_class_malloc(C), 13, 282 
number of processors, 203 
number of threads, 204 
stack memory type, 205 
synchronization, 246 
thread ID, 205 
unlock_gate, 249 
unlocking, 249 
wait_barrier, 249 
functions, CPSlib 
[mc]_cond_lock(), 315 
[mc]_fetch_and_add32(), 315, 338 
[mc]_fetch_and_clear32(), 315, 338 
[mc]_fetch_and_dec32(), 316, 338 
[mc]_fetch_and_inc32(), 316, 338 
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[mc]_fetch_and_set3(), 338 
[mc]_fetch_and_set32(), 316 
[mc]_fetch32(), 315, 338 
[mc]_free32(), 315 
[mc]_init32(), 315, 316, 337 
asymmetric, 312 
cps_barrier(), 313 
cps_barrier_alloc(), 313 
cps_barrier_free{), 314, 332 
cps_complex_cpus(), 312 
cps_complex_nodes(), 313 
cps_complex_nthreads(), 313 
cps_is_parallel(), 313 
cps_l i mited_spi n_mutex_al I oc(), 314 
cps_mutex_alloc(), 314 
cps_mutex_free(), 314 
cps_mutex_lock(), 314 
cps_mutex_trylock(), 315 
cps_mutex_unlock(), 315 
cps_nthreads(), 318 
cps_plevel(), 313, 329 
cps_ppcall(), 311, 318 
cps_ppcalln(), 311, 318 
cps_ppcallv(), 311 
cps_set_threads(), 313 
cps_stid(), 311, 318 
cps_thread_create(), 312, 329 
cps_thread_createn(), 312, 329 
cps_thread_exit(), 312, 329 
cps_thread_register_lock(), 312, 329 
cps_thread_wait(), 312 
cps_topology(), 313 
cps_wait_attr(), 312 
high-level mutexes, 314 
high-level-barriers, 313 
informational, 312 
low-level counter semaphores, 315 
low-level locks, 315 
symmetric, 311 
functions, math 
acos, 126 
asin, 126 
atan, 126 


atan2, 126 
cos, 126 
exp, 126 
log, 126 
loglO, 126 
pow, 126 
sin, 126 
tan, 126 

functions, pthread 
[mc]_unlock(), 315 
pthread_create(), 312 
pthread_exit(), 312 
pthreadJoin(), 312 
pthread_mutex_destroy(), 314 
pthread_mutex_init(), 314, 315, 335 
pthread_mutex_lock(), 314, 315, 335 
pthread_mutex_trylock(), 315, 335 
pthread_mutex_unlock(), 315, 335, 336 
pthread_num_processors_np(), 312, 313, 319 


gate variable declaration, 245 
gates, 147, 189, 245 
allocating, 245 
deallocating, 245, 257 
equivalencing, 246 
locking, 245 
unlocking, 245 
user-defined, 257 
global 
arrays, 282 
optimization, 91 
pointers, 122 

register allocation, 37, 42, 43, 138 
variables, 32, 135, 140, 277 
GOTO statement, 39, 67, 68 
GRA, 37, 42, 43, 138 
guidelines 
aC++, 31, 32 
C, 31, 32 
coding, 32 
Fortran 90, 31, 34 
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H 

hardware history mechanism, 139 
header file, 124, 236 
hidden 
aliases, 276 
ordered sections, 292 
horizontal ellipses, xviii 
HP MPI, 4 

HP MPI User's Guide, 5, 111 

HP-UX Floating-Point Guide, 126, 139, 290 

hypernode, V2250, 11 

I 

I/O statement, 67 
idle 

CPS threads, 317 
threads, 100 

increase replication limit, 87 
incrementing by zero, 304 
induction 
constants, 28 
variables, 28, 196 
induction variables, 51, 222, 276 
in region privatization, 230 
information, parallel, 203 
inhibit 

data localization, 59 
fusion, 79 
localization, 68, 69 
loop blocking, 76 
loop interchange, 60, 179 
parallelization, 274 
inlined calls, 154, 155 
inlined procedures 
delete, 140 
inlining, 124 
across multiple files, 92 
aggressive, 125 
at +03, 92 
at +04, 92 
default level, 125 
within one source file, 55 


inner-loop memory accesses, 82 
instruction 
fall-through, 39 
scheduler, 27, 41 
scheduling, 39, 120 
integer arithmetic operations, 136 
interchange, loop, 63, 68, 77, 82, 90 
interleaving, 17, 18, 19, 20 
interprocedural optimization, 57 
invalid subscripts, 273, 291 
italic, xvii 
iteration 

distribution, controlling, 281 
distribution, default, 283 
stop values, 275 
iterations, consecutive, 253 

K 

K-Class servers, 9, 235 
kernel parameter, 202 
kernel parameters, 23 

L 

large trip counts, 307 
LCDs, 59, 287, 292 
disable, 60 
output, 106 
levels 
block, 27 
optimization, 307 
library calls 
allocaQ, 125 
fabs(), 125 
sqrt(), 125 
strcpyO, 125 
library routines, 126 
limitations, ordered sections, 261, 262 
linear 

functions, 51 
test replacement, 305 
lint, 32 

local variables, 32, 218 
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localization, data, 28, 58 
location of compilers, 25 
lock_gatefunctions, 248 
locking 

functions, 248 
gates, 245 
locks, low-level, 315 
log math function, 126 
logical expression, 36 
loop, 225 
arrays, 69 
blocked 70 

blocking, 28, 54, 58, 70, 76, 79, 82, 85, 89, 127, 
154, 155 

blocking, inhibit, 76 
branch destination, 67 
counter, 276 
customized, 222 
dependence, 149 
disjoint, 99 

distribution, 28, 58, 70, 79, 82, 85, 89,127,154, 
155 

distribution, disable, 148 
entries, extra, 68 
entries, multiple, 59 
fused, 157, 162 

fusion, 28, 58, 70, 79, 80, 82, 89, 127, 155 
fusion dependences, 59, 64 
induction, 181 
induction variable, 196 

interchange, 28, 58, 67, 68, 69, 70, 76, 77, 79, 
82, 85, 89, 90, 154, 155 
interchange, avoid, 63 
interchange, inhibit, 60, 179 
interchanges, 150 
invocation, 185 
iterations, 279 
jamming, 128 
multi pie entries in, 68 
nest, 45, 76 
nested, 20, 84, 85 
nests, 153 
number of, 104 


optimization, 53 
optimize, 149 
overhead, eliminating, 128 
parallelizing, 222 
peeled iteration of, 80 
peeling, 80, 155 
preventing, 28 
promotion, 155 
reduction, 157 
relocate, 82 
removing, 157 
reordering, 28, 89 
replication, 45, 58 
restrict execution, 182 
serial, 20, 183 
source line of, 159 
strip length, 54 
table, 159 

thread parallelization, 191 
transformations, 7, 58, 82, 97 
unroll, 45, 79, 82, 84, 89, 127 
unroll and jam, 28, 54, 58, 79, 82, 84, 89, 127, 
154, 155 

unroll factors, 87 
unroll_andJam, 70 
unrolling, 42, 45, 46, 58, 128 
Loop Report, 137, 151, 153, 159 
loop unrolling example, 45 
loop, strip, 72 
LOOP_PARALLEL, 181 
loop_parallel, 28 

LOOP_PARALLEL directiveand pragma, 28, 94, 
118, 129, 176, 179, 185 
example, 187, 188, 222 
LOOP_PARALLEL(ORDERED) directiveand 
pragma, 253, 295 
example, 253 

LOOP_PRIVATE directiveand pragma, 218, 220 
arrays, 225 
example, 221 

loop-carried dependences, 59, 60, 287, 292 
loop-invariant, 46 
code, 42, 45 
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codemotion, 136 
loop-iteration count, 102 
loops 

adjacent, 80 
constructing, 31 
data-localized, 7 
DO (Fortran), 178, 220 
DO WHILE (Fortran), 184 
exploit parallel code, 254 
for (C), 178, 220 
fusable, 79 
fusing, 150 

induction variables in parallel, 222 

multiple entries, 68 

neighboring, 79 

number of parallelizable, 79 

parallelizing, 175 

parallelizing inner, 298 

parallelizing outer, 298 

privatization for, 159 

privatizing, 217 

reducing, 79 

replicated, 90 

safely parallelizing, 275 

simple, 102 

that manipulate variables, 217 
triangular, 188, 296 
unparallelizable, 180 
loop-variant, 46 
low-level 

counter semaphores, 315, 337 
LSIZE, 286 

M 

machine 

instruction optimization, 27 
instructions, 84 
loading, 96 
MACS, 10 
malloc, 12, 282 
man pages, xviii 

Managing Systems and Workgroups, 202 


manual 

parallelization, 179, 218 
synchronization, 218, 264 
map-coloring, 44 
Mark D. Dipasquale, 318 
math functions, 126 
matrix multiply blocking, 74 
memory 
banks, 10 
encached, 20 
hypernode local, 233 
inner-loop access, 82 
layout scheme, 33 
mapping, 34 
overlap, 130 
physical, 17 
references, 140 
semaphores, 335 
space, occupying same, 275 
usage, 126 
virtual, 18, 46 

M emory Access Control I ers, 10 
memory class, 218, 238 
assignments, 236 
declarations (C/C++), 236 
declarations (Fortran), 236 
misused, 273 

node_private, 233, 235, 241 
thread_private, 233, 235 
memory_class_malloc, 12, 13, 282 
message-passing, 4 
minimum page size, 23 
misused 

directives and pragmas, 292 
memory classes, 273 
monospace, xvii 

M P_l DLE_TH READS_WAIT, 100, 317 
MP_NUM BE R_OF_TH READS, 94, 130, 317 
M PI, 4 

multi node servers, 309 
multiple 

data dependences, 243 
entries in loop, 68 
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exits, 69 

multiplication, 40 
mutexes, 332, 335 
high-level, 314 

N 

natural boundaries, 37 
nested 
loop, 20 

parallelism, 244 

NEXT_TASK directive and pragma, 94, 177, 192 
NO_BLOCK_LOOP directive and pragma, 70, 
146, 148 

NO_DI STRI BUTE directiveand pragma, 77,146, 

148 

NO_DYNSEL directiveand pragma, 146, 149 
NO_LOOP_DEPENDENCE directiveand 
pragma, 60, 63, 149, 294 
directives 

NO_LOOP_DEPENDENCE, 146 
NO_LOOP_TRANSFORM directiveand pragma, 
89, 146, 149 

NO_PARALLEL directiveand pragma, 110, 146, 

149 

NO_SI DE_EFFECTS directiveand pragma, 146, 

150 

NO_UNROLL_ANDJ AM directiveand pragma, 
85, 146 

NOJJNROLLJAM directive and pragma, 84 
node_private. 111 
example, 241 

static assignment of, 238, 241 
virtual memory class, 233, 235 
nondeterminism of parallel execution, 292, 295 
non ordered 
dependences, 254 
manipulations, 177 
nonstatic variables, 34, 123 
Norton, Scott, 318 
notational conventions, xvii 
number of 
processors, 129, 203 


threads, 204 

O 

O, 143 

objects, stack-based, 237 
offset indexes, 286 
OpenMP, 208 

Command-line Options, 209 
default, 209 
defined, 208 

effect on FI PPM directives, 212 
More information, 215 
syntax, 211 
www.openmp.org, 215 
operands, 36 
optimization, 27 
■400, 27 
+01, 27 
+02, 28, 40, 58 

+03, 28, 55, 57, 58, 70, 77, 79, 82, 84, 89 

+04, 55, 57 

aliasing, 64 

block-level, 27, 39 

branch, 27, 39 

cloning within onefile, 57 

command-line options, 27, 93 

cross-module, 53, 91 

cumulative, 58 

data localization, 58, 69 

dead code, 39 

directives, 113 

faster register allocation, 39 

features, 27, 35, 53 

file-level, 28 

FM A, 122 

global, 91 

I /O statements, 67 

inlining across multiplefiles, 92 

inlining within onefile, 55 

interprocedural, 57, 112 

levels, 25, 274, 307 

loop, 53 
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loop blocking, 70 
loop distribution, 77 
loop fusion, 79 
loop interchange, 82 
loop unroll and jam, 84 
multiple loop entries, 68 
multiple loop exits, 68 
options, 113 
peephole, 27, 39, 41 
pragmas, 113 
routine-level, 28, 42 
static variable, 91 
store/copy, 27 
strip mining, 54 
test promotion, 90 
unit-level, 6 
using, 31 
valid options, 114 

Optimization Report, 85, 90, 151, 158, 183 
contents, 137 
Optimization Reports, 275 
optimizations 
advanced, 7 
advanced scalar, 7 
aggressive, 118 
architecture-specific, 141 
floating-point, 121 
increase code size, 138 
loop reordering, 89 
scalar, 6, 7 
suppress, 138 
that replicate code, 87 
optimize 

instruction scheduling, 120 
large programs, 139 
loop, 149 
ordered 

data dependences, 243 
parallelism, 194, 253 
sections, 255 
ordered sections 
hidden, 292 
limitations of, 261, 262 
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using, 259 

ORDERED_SECTION directiveand pragma, 177, 
255 

output LCDs, 106 
overflowing trip counts, 305 
overlap, memory, 130 


PA-8200, 23 

page size, minimum, 23 

parallel 

assignments, 44 
command-line options, 93 
construct, 254 
executables, 12 
execution, 295 

information functions, 175, 203 

programming, 9 

programming techniques, 175 

regions, 176 

structure, 244 

tasks, 177 

threads, 138 

PARALLEL directiveand pragma, 94, 176 
PARALLEL_PRIVATE directiveand pragma, 
218, 229 
example, 229 
parallelism, 30, 110 
asymmetric, 329 
automatic, 94 
in aC++, 111 
inhibit, 28 
I evel s of, 94 
loop level, 94 
nested, 244 
ordered, 194, 253 
region level, 94 
stride-based, 186 
strip-based, 99, 186 
task level, 94 
thread, 244 
unordered, 193 



parallelization, 28, 54 
force, 176, 179 
i n aC++. 28 
increase, 178 
inhibit, 274 
manual, 179, 218 
overhead, 299 
prevent, 28 
preventing, 110 
parallelizing 
code outside a loop, 192 
consecutive code blocks, 177 
inner loops, 298 
loop, 222 
loops, safely, 275 
next loops, 178 
outer loops, 298 
regions, 197 
tasks, 192 
threads, 183, 191 
parameters, kernel, 23 
partial evaluation, 36 
PCI bus controller, 10 
peephole optimization, 27, 41 
performance 
enhance, 12 

shared-memory programs, 218 
physical memory, 17 
pipelining, 41 
prerequisites, 49 
software, 49 
pointers, 32 
C, 274 

dereferences, 143 
strongly-typed, 132 
type-safe, 132 
using as loop counter, 276 
poor locality, 139 
porting 

CPSlib functions to pthreads, 309 
multi node applications, 235 
X-Class to K-Class, 234 
X-Class to V-Class, 234 


POSIX threads, 111, 309 
potential alias, 275 
pow math function, 126 
pragmas 

begin_tasks, 94, 177, 192 
blockjoop, 70, 76, 146, 148 
critical_section, 177, 189, 254 
crtitical_section, 257 
dynsel, 146, 148 

end_critical_section, 177, 189, 254 

end_ordered_section, 255 

end_parallel, 28, 94, 176 

end_tasks, 94, 177, 192 

loop_parallel, 28, 94, 118, 176, 179, 181, 185 

loop_parallel(ordered), 253 

loop_private, 218, 220 

misused, 292 

next_task, 94, 177, 192 

no_block_loop, 70, 146, 148 

no_distribute, 146, 148 

no_dynsel, 146, 149 

no_loop_dependence, 60, 146, 149 

no_loop_transform, 89, 146, 149 

no_parallel, 110, 146, 149 

no_side_effects, 146, 150 

no_unroll_andJam, 85, 146 

ordered_section, 177, 255 

parallel, 28, 94, 176 

parallel_private, 218, 229 

prefer_parallel, 28, 94, 176, 178, 181, 185 

privatizing, 218 

reduction, 146, 177 

savejast, 218, 224 

scalar, 146 

sync_routine, 44, 146, 177, 250 
task_private, 196, 218, 227 
unroll_andJam, 85, 146, 150 
prefer_parallel, 182 

PREFER_PARALLEL directive and pragma, 28, 
94, 129, 176, 178, 181, 185 
example, 187, 188 
prevent 

loop interchange, 67 
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parallel code, 149 
parallelism, 110 
primary induction variable, 184 
private data, 235 
privatization 
data, 185 
variable, 159 

Privatization Table, 137, 152, 159 
privatizing 
directives, 218 
loop data, 220 
loops, 159 
parallel loops, 218 
pragmas, 218 
regions, 218, 229 
tasks, 218, 227 
variables, 218 
procedure calls, 59, 274 
procedures, 6 
processors 
number of, 203 
specify number of, 129 
program 
behavior, 120 
overhead, 255, 256, 299 
units, 6 

programming models 
message-passing, 4 
shared-memory, 3 
programming parallel, 9 
propagation, 43 
prototype definition, 125 
pthread 

mutex functions, 335 
mutexes, 337 

pthread asymmetric functions 
pthread_create<), 312 
pthread_exit(), 312 
pthread_join(), 312 
pthread informational functions 
pthread_num_processors_np(), 312, 313 
pthread synchronization functions 
[mc]_unlock(), 315 
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pthread_mutex_destroy(), 314 
pthread_mutex_init(), 314, 315 
pthread_mutex_lock(), 314, 315 
pthread_mutex_trylock(), 315 
pthread_mutex_unlock(), 315 
pthread.h, 310 
pthread_mutex_init(), 335 
pthread_mutex_lock(), 335 
pthread_mutex_trylock(), 335 
pthread_mutex_unlock(), 335, 336 
pthreads, 111, 309 
accessing, 309, 310 
and environment variables, 317 


REAL variable, 130 
REAL*8 variable, 130, 290 
reduction 
examples, 109 
force, 177 
form of, 108 
loop, 157 

REDUCTION directive and pragma, 146, 177 

reductions, 28, 289, 292, 294 

reentrant compilation, 175, 201 

region privatization, induction variables in, 230 

regions 

parallelizing, 175, 197 
parallelizing, example, 199 
privatizing, 217, 229 
register 
allocation, 44 
allocation, disable, 138 
exploitation, 128 
increase exploitation of, 84 
reassociation, 46 
usage, 79 
use, improved, 128 
registers, 27, 51 
global allocation, 37, 42, 43 
simplealignment, 37 
reordering, 154 



replicate code, 87 
replication limit, increase, 87 
reportjype, 137, 152 
reportjype values 
all, 152 
loop, 152 
none, 152 
private, 152 

RETURN statement, 59, 68 
return statement, 59, 68 
reuse 

spatial, 71, 74 
temporal, 71, 74, 84 
routine-level optimization, 28, 42 
routines 

user-defined, 250 
vector, 139 
rules 

ANSI standard, 273 
scoping, 241 

S 

SAVE variable, 91 

SAVE_LAST directive and pragma, 218, 224 
example, 225 
scalar 
code, 197 

optimizations, 6, 7 
variables, 43, 285 
SCALAR directive and pragma, 146 
scheduler, instruction, 41 
scope of this manual, xvi 
scoping rules, 241 
Scott Norton, 318 
secondary induction variables, 223 
example, 223 
semaphores 
binary, 335 
low-level, 315 
low-level counter, 337 
serial 

function, 20 


loop, 183 
servers 

K-Class, 9, 141 
V2250, 9, 141 
V-Class, 9, 141 
shared 
data, 4 
variable, 177 
shared-memory, 3 

shared-memory programs, optimize, 233 
short, 32 

short-circuiting, 36 
signed/unsigned type distinctions, 144 
simple loops, 102 
sin math function, 126 
single-node servers 
porting multinodeapps to, 235 
SMP 

architecture, 1, 2 

software pipelining, 27, 42, 49, 130, 136 
space, virtual address, 17 
spatial reuse, 71, 74 
spawn 

parallel processes, 4 
thread ID, 96 
threads, 218 
speed, execution, 130 
spin 

suspend, 317 
wait, 317 

spp_prog_model.h, 203, 236 

sqrt(), 125 

stack 

memory type, 205 
size, default, 202 
stack-based objects, 237 
statements 

ALLOCATE (Fortran), 13, 282 
COMMON (Fortran), 246 
DATA (Fortran), 235, 282 
DIMENSION (Fortran), 246 
EQUIVALENCE (Fortran), 64, 274 
exit (C/C++), 68 


379 



GOTO (Fortran), 67, 68 
I/O (Fortran), 67 
return (C/C++), 59, 68 
RETURN (Fortran), 59, 68 
stop (C/C++), 59 
STOP (Fortran), 59, 68 
throw (C++), 69 
type, 246 
static 

variables, 34, 91 
static assignments 
node_private, 238, 241 
thread_private, 238 
STOP statement, 59, 68 
stop statement, 59 
stop variables, 277 
storage class, 237 
external, 282 
storage location 
of global data, 91 
of static data, 91 
strcpyO, 125 

strength reduction, 27, 51, 136 
stride-based parallelism, 186 
strip mining, 54, 97 
example, 54 
length, 72 

strip-based parallelism, 99, 186 
strip-mining, 7 
strlenO, 125 

strongly-typed pointers, 132 
structs, 32, 282 
structure type, 144 
subroutine call, 155 
sudden underflow, enabling, 290 
sum operations, 109 
suppress optimizations, 138 
suspend wait, 317 
sync_routine, 44, 250 

SYNC_ROUTI NE directive and pragma, 146, 177 
example, 251, 252 
synchronization 
functions, 246 


intrinsics, 253 
manual, 218, 264 
using high-level barriers, 313 
using high-level mutexes, 314 
using low-level counter semaphores, 315 
synchronize 
code, 257 
dependences, 197 
symmetrically parallel code, 332 
syntax 

OpenMP, 211 
syntax extensions, 236 
syntax, command, xviii 


tan math function, 126 

TASK_PRIVATE directiveand pragma, 196, 218, 
227 

example, 227 
tasks 

parallelizing, 175, 177, 192 
parallelizing, example, 195, 196 
privatizing, 217, 227 
Tbyte, 4 

temporal reuse, 71, 74, 84 

terabyte, 4 

test 

conditions, 27 
promotion, 28, 90, 154 
text segment, 23 
THEN clause, 39 
thrashing, cache, 298 
thread, 148 
affinity, 100 
ID, 205, 244 
ID assignments, 244 
idle, 96 
noidle, 96 
spawn ID, 96 
stack, 205 
suspended, 100 
waking a, 100 
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thread_private. 111 
example, 238, 239 
static assignment of, 238 
virtual memory class, 233, 235 
thread_trip_count, 104 
thread-parallel construct, 244 
threads, 96 
child, 317 
create, 317 
idle, 100, 317 
number of, 204 
parallelizing, 183, 191 
spawn parallel, 102 
spawned, 218 

thread-specific array elements, 284 
Threadtime, 318 
threshold iteration counts, 104 
throw statement, 59, 69 
time, 118 

transformations, 39 
loop, 97 

reordering, 149 
triangular loops, 188, 296 
trip counts 
large, 307 
overflowing, 305 
type 

aliasing, 134, 136 
casting, 132 

names, synonymous, 144 
specifier, 237 
statements, 246 
union, 144 
type-checking, 274 
type-incompatible assignments, 145 
type-inferred aliasing rules, 143 
type-safe 
algorithm, 274 
pointers, 132 


U 

unaligned arrays, 286 


uninitialized variables, 123 
union type, 144 
unIock_gate function, 249 
unlocking 
functions, 249 
gates, 245 

unordered parallelism, 193 
unparal lei izable loops, 180 
Unroll andJ am, 156 
unroll and jam, 28 
automatic, 128 
directive-specified, 128 
unroll factors, 46, 87 

UNROLL_ANDJ AM directive and pragma, 85, 
146, 150 

unrolling, excessive, 87 
unsafe type cast, 133 
unused definition elimination, 52 
using 

a pointer as a loop counter, 276 
critical sections, 257 
hidden aliases as pointers, 276 
ordered sections, 259 


V2250 servers, 9, 71, 141, 233 
chunk size, 303 
hypernode overview, 11 
valid page sizes, 23 
variables 
accumulator, 289 
char, 32 

COMMON (Fortran), 34, 91, 150 

create temporary, 277 

double (C/C++), 130, 290 

extern (C/C++), 91 

float (C/C++), 130 

global, 32, 135, 140, 277 

induction, 28, 45, 222, 230, 276 

iteration, 45 

local, 31, 32, 34, 218 

loop induction, 181 


381 



nonstatic, 34, 123 

primary induction, 184 

privatizing, 159, 185, 218 

REAL (Fortran), 130 

REAL*8 (Fortran), 130, 290 

register, 32 

SAVE (Fortran), 91 

scalar, 37, 43, 285 

secondary induction, 223 

secondary induction, example, 223 

shared, 177, 235 

shared-memory, 138 

short, 32 

static, 34, 123 

static (C/C++), 91 

stop, 277 

uninitialized, 123 

values of, 36 

V-Class Architecture manual, 9 
V-CI ass servers, 9, 235 
hy pern ode overview, 11 
vector routines, 139, 140 
vertical ellipses, xviii 
virtual 

address space, 17 
memory, 18 
memory address, 46 
volatile attribute, 33 
vps_ceiling, 23 
vps_chatr_ceiling, 23 
vps_pagesize, 23 

W 

-W compiler option, 290 
wait_barrier functions, 249 
workload-based dynamic selection, 102, 149 

X 

X-class, 234 
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